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CONCERNING THE LENGTH OF NEW-TYPE 
EXAMINATIONS* 


CHARLES BIRD 


University of Minnesota 
AND 
DOROTHY M. ANDREW 


Pennsylvania College for Women 


The economy of time and effort attributed as an advantage of new- 
type examinations by writers of manuals has not been enjoyed by all 
instructors making serious efforts to escape limitations inherent in 
essay examinations. There is evidence that the constant demand for 
new items is time consuming. A frequent change of items is made 
imperative by new textbooks, and more especially by the necessity of 
keeping the new-type examination a measure of course content rather 
than an index of ability to memorize particular sets of questions not 
officially open to inspection. Not even access to files of questions, by 
the instructor, obviates careful selection and the balancing of items to 
sample the course adequately. That this statement carries little 
weight bears mute testimony to the ease with which enthusiasms dis- 
tort facts. He who keeps a record of time spent selecting questions 
from files will agree that the process may be simple, because of knowl- 
edge of subject-matter, while still being costly of time. Such a record 
was kept by the writers in the year 1934. It was necessary to assemble 
two final examinations, each one having one hundred fifty items. 
There were available forty-five new items for each test so that two 
hundred ten items had to be taken from files. The selection of one 
hundred five items and their integration into an examination not having 





* The writers acknowledge their indebtedness to the National Youth Adminis- 
tration for the services of Federal Aid Students in validating test items and rescor- 
ing the 1935 examinations given in the Psychology Department at the University 
of Minnesota. 
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overlapping answers required nineteen hours and fifty minutes. A 
second form indicated practice effects, or perhaps ennui made for less 
care, so that the selection and organization of the questions required 
eighteen hours only. These records are not arguments against new- 
type examinations; they are warnings that we do not expect too much 
saving of time. 

Let us look more intimately into the preparation of new-type tests. 
Two final examinations must be prepared for a class of eight hundred 
fifty students in General Psychology. Since the group is divided into 
two sections, each having a different hour of the day for classes, the 
examinations can not be given simultaneously. It is necessary to 
prevent either a paper from becoming the illegal property of members 
of the class taking the examination two days after the first group, or 
knowledge of specific items becoming known through one of the many 
devices available to students. So our objective is to prepare for each 
examination forty-five analogy questions, forty-five single choice 
questions, thirty single word completion questions and thirty wrong 
word answer questions.* How long does it require to attain the objec- 
tive? After spending thirty-three hours and thirty minutes, eleven 
graduate students have contributed one hundred thirty-two questions. 
The questions are reviewed carefully by a committee of two faculty 
members and two graduate assistants, who combining their efforts for 
a three-hour session retain ninety questions. These ninety questions 
have required, therefore, a total of forty-five hours and thirty minutes 
from faculty members and graduate students. To complete the 
examinations, by resorting to files, demands an additional thirty-seven 
hours and fifty minutes, or expressed as clock hours of work, the tests 
have consumed eighty-three hours and twenty minutes. This time is 
not quite correct, for the two hundred ten items selected from the files 
had a history similar to that of the new questions. They required 
initially the combined efforts of many workers, so that each question 
represents probably an investment of thirty minutes. Were it possible 
to select questions always from a file, that is, if all factors threatening 
these new-type tests as measures of student achievement could be 
removed, then considerable time would be saved. We recall that the 





* For descriptions of the first three types of questions see: Paterson, D. G.: 
Preparation and Use of New-type Examinations. Chicago: World Book Co., 1927, 
87 pp.; and for examples of the last named questions see: Holmes, G. and Heid- 
breder, E.: ‘‘A statistical study of a new iype of objective examination question.”’ 
J. of Educ. Res., Vol. XXIV, 1931, pp. 286-292. 
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ninety new questions consumed forty-five hours and thirty minutes, 
whereas the selection of two hundred ten items and their arrangement 
into two complete examinations required almost thirty-eight hours. It 
is to be feared that conditions permitting the frequent use of old ques- 
tions will occur rarely to lighten the load of instructors. 

It may serve future investigations if the total time required to 
prepare and score these two tests is recorded. The cutting of keys for 
scoring required twelve hours of work by a graduate student, and the 
scoring of the tests demanded one hundred and fourteen hours. Most 
of the scoring was done by graduate students and faculty members. 
Thus, a simple process of counting indicates that the total hours 
expended directly in the preparing and scoring of two new-type tests, 
given to about eight hundred fifty students, number two hundred nine 
hours and twenty minutes. Whether essay examinations would have 
required more time, or would have been as reliable and valid measures 
of learning is not our problem. It is a problem important enough to 
deserve investigation. Numerical data should supplant facile assump- 
tions whenever arguments about the relative merits of new-type and 
essay examinations are involved. We are concerned, however, with 
effects upon student grades resulting from reduction in length of new- 
type examinations. 

The paucity of inquiries into the length of new-type examinations 
possibly results from the uncritical acceptance among college teachers 
of units of time rather than of the number and character of items as a 
measure of length. Old standards of a ‘“‘one hour” test or a ‘‘two 
hour” final examination were adopted without reflection. Having 
done this, we then tried to devise that number of new-type questions 
which would keep students busy for the one- or two-hour period. 
Gradually many instructors reduced the number of items under the 
stress of more urgent duties. How far can the reduction go? For- 
tunately there is a growing concern about the structure and function of 
new-type questions, which, if it continues, or, as is characteristic of 
movements in educational measurement, if it accelerates, will produce 
a marked improvement in techniques. But greater precision of con- 
cept and of description is essential before inquiries have unequivocal 
utility. Isit not futile to write of one-hour or of two-hour “objective”’ 
tests without specifying the number and the kind of items used? 
Recall questions demand more time than recognition questions and 
certain kinds of questions are more valid than others (2). When, 
therefore, we read, “‘It is probable that a two-hour objective test is 
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sufficiently comprehensive so that lengthening the test results in little 
gain in reliability’’ (1, p. 359) we feel that significant generalization is 
impossible since neither the number nor the character of the items in 
the examination is known. 

That the reliability of new-type tests can be increased by pooling 
short tests and that examinations can be reduced in number without 
seriously interfering with their usefulness seems clear from the investi- 
gation of Turney (4). This investigator states the number and kinds 
of items used. He concludes, ‘“‘It would seem to be fairly practical 
to use either the cumulative score on the short test or the final alone as 
a basis for grades,” and again, ‘‘ There seems to be no very great reason 
why a single final examination might not be sufficient except for the 
psychological effect upon the student” (4, p. 295). A similar note is 
struck by Lee (8, p. 167) when he states that the median number of 
objective questions included by about sixteen hundred secondary- 
school teathers in their objective tests is thirty-one. This number of 
items, Lee believes, is too short to be valid, but he assumes that objec- 
tive tests can be made sufficiently valid if scores from a number of such 
tests are combined. How many test items need to be combined is not 
indicated. Parenthetically, may it be suggested that the relatively 
small median number of items bears silent testimony to the difficulty 
and the long time required for teachers to prepare by themselves new- 
type tests. They should be encouraged to retain all old items for 
future use. 

In the present inquiry into the desirability of reducing the number 
of items in final examinations from one hundred fifty to one hundred 
items, we shall be concerned primarily with the practical bearing of 
such reduction upon course grades of students. Incidentally, changes 
in reliability can be noted and contrasted with statistically predictable 
indices, but the major emphasis is upon the validity of the tests. Of 
considerable consequence is the matter of raising a burden from 
examination-ridden teachers. The continuation of new-type tech- 
niques should hinge upon the demonstration of their economy as well as 
upon their superiority from the standpoint of internal consistency and 
validity. 

At the University of Minnesota, the grade in the general psychology 
course reported each quarter is based upon student performance in 
three new-type tests. Because this inquiry is based upon results 
obtained during the two years, 1934 and 1935, and because changes in 
the character of the examinations occurred, it seems desirable to resort 
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to Table I. The shorter tests are popularly referred to as one-hour 
tests and the longer ones are assumed to be two-hour tests. Approxi- 
mately three and a half weeks separates each examination. Our 
analysis involved the final examinations primarily but it has been 
necessary to make reference to the short tests because they contribute 
to the final grade in the course. 


TaBLeE I.—THE STRUCTURE OF THE 1934 AND 1935 EXAMINATIONS AND THE RANGE 
oF Scores oN THESE TESTS 











Length of Kind and number of questions 
Year | examination Range of 
in items S.c.* | An. S.W.C. | W.W.A. sean 
1934 65 25 15 10 15 20-57 
65 25 15 10 15 20-59 
150 45 45 30 30 65-136 
1935 75 45 - 30 - 22-71 
75 45 i 30 - 24-68 
150 75 - 75 _ 43-130 
150 75 Ks 75 - 36-131 























* Abbreviations indicate the following kinds of questions: Single Choice, 
Analogy, Single-word-completion, and Wrong-word-answer. 


To reduce the final examinations from one hundred fifty items to 
one hundred items, every third item was eliminated in the process of 
re-scoring. This technique was used to avoid the sampling of a small 
segment of the course. New letter grades were assigned to the papers. 
In assigning grades, the writers adhered as closely as possible to the 
proportions of letter grades A, B, C, D and F, prevailing in the actual 
final examination. That distributions defy exact duplications of 
percentages is a well-known fact, but it is one which needs to be 
remembered when questions arise relating to changes in grades. A 
similar procedure was followed in computing the course grade. The 
shorter test was substituted for the longer one and the scores were 
combined with those resulting from the two one-hour tests. 

The effect of reducing the length of the examination upon the reli- 
ability of the tests is indicated in Table II. For the year 1935, it needs 
to be pointed out that forms A-B contained identical questions but in 
different arrangements, and that forms C-D, although arranged like 
forms A-B, were entirely different in content. Later, comparable 
forms are treated as one examination. Apart from calling attention to 
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the fact that the reliability coefficients are lower, as is to be expected 
when the test is reduced in size, the only comment seemingly necessary 
is that these coefficients are not strikingly different from many already 
reported in the literature where new-type tests are being considered. 


TaBLE IJ.—RELIABILITY (ODD-EVEN) OF THE NEwW-TyPE TESTS 




















Number | One hundred fifty item One hundred item 
Year Form “ole ee 

of papers examination examination 
1934 ae 382 ae S.B.* 87 71 S.B. .83 
1935 A 222 81 S.B. .90 71 S.B. 83 
1935 B 212 84 S.B. 91 .76 S.B. 86 
1935 Cc 196 .78 S.B. .88 .62 S.B. 77 
1935 D 195 86 S.B. .93 71 S.B. .83 

















* Coefficients calculated by use of the Spearman-Brown Prophecy Formula. 


It might be profitable to examine the extent to which the separate 
items in the 1935 examinations differentiated students receiving grades 
ranging from A to F. We tabulated the percentage of students with 
any one of the five possible course grades passing or failing each item. 
Those items showing decreasing percentages of each group from the A 
grade through the F grade passing were tabulated as discriminating 
perfectly. An item which was passed by a higher percentage of a 
lower grade group than the immediately higher grade group, for 


TaBLeE III.—Vauipity or ITremMs UsEp In Two SEPARATE FoRMS OF THE 1935 


FinaL EXAMINATION 





Form A, kind of question 


Form D, kind of question 



































Single Single-word- Single Single-word- 
Degree of discrimination choice completion choice completion 
Num-| Per | Num-} Per | Num-| Per | Num-| Per 
ber | cent| ber | cent| ber | cent} ber | cent 
Perfect discrimination..... 19 25.4) 38 50.6) 19 25.4) 39 52.0 
One letter grade displace- 

PT ee cere er 24 32.0) 15 20.0} 20 26.7) 22 29.4 
A-B from D-F students.... 9 12.0) 12 16.0 8 10.7 5 6.6 
Bs inte nck bd eeeee es 23 30.6} 10 13.4| 28 37.2 9 12.0 

75 meet 75 |100.0| 75 {|100.0; 75 /|100.0 
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example, a higher percentage of ‘‘F’”’ students than of ‘‘D”’ students 
pass an item, was put in the category of one grade displacement. If 
more than one inversion occurred then the criterion of validity used 
was whether or not the question differentiated the combined A-B grade 
students from the D-F students, and failing this the item was rated as 
having no power of discrimination. The discriminating value of the 
1935 examinations is represented in Table III. 

That many items failed to differentiate students making high scores 
in the tests from those making low scores, after great care and much 
time had been spent preparing questions, is a point which lends 
emphasis to repeated warnings that economy of scoring is not to be 
confused with economy of preparation. Valid questions, or good ques- 


TABLE IV.—CoRRELATIONS AMONG FINAL EXAMINATIONS OF UNEQUAL LENGTH 
AND THE AVERAGE ScorE IN Two ONE-HOUR EXAMINATIONS 

















— One hundred fifty) One hundred 
Year Form item final item final 
papers 
| r Tr 
1934 > 382 .79 .80 
1935 A-B 434 | 85 | 81 
1935 | C-D 391 .78 76 











tions, are achievements attained only after much practice and after 
actually trying out theitems. The inequality of discrimination among 
types of questions is of interest to note, but this matter will be con- 
sidered separately for approximately two thousand items in a forth- 
coming publication. 

Another index of validity is the degree to which the long or short 
final examinations correlate with the combined scores of the two one- 
hour tests. These indices are represented in Table IV. They indicate 
as close agreement among the one hundred item final examinations and 
the average of the pre-final tests as exists among the one hundred fifty 
item examinations and the pre-final tests. In all probability, the vari- 
ables operating in our examinations are many, so that the differences in 
reliability gained by increasing the length of examinations beyond the 
one hundred item limit appear to be too slight to produce greater 
agreement among tests given at different intervals throughout an 
academic term. 











648 The Journal of Educational Psychology 


To what extent, then, will letter grades differ when one hundred 
items are used instead of one hundred fifty items? By how many 
points on distribution curves will they fail to agree? The boundaries 
of any two letter grades shade imperceptibly into each other but we are 
forced to assign, for example, a ‘“‘D”’ grade to a student whose score is 
one point less than the scores of students with a grade of ‘“‘C.”’ Let us 
assume a student receives a grade of ‘‘C’”’ in the one hundred fifty item 
test by the lowest possible score for that grade and that he receives a 
grade of ‘‘D” in the one hundred item test with the highest score for 
the ‘“‘D” papers, then the two grades are within one point of perfect 
agreement. In the tables which follow, wherever failure of agreement 
is denoted, the extent is expressed in terms of units of the range of 
scores. 

Examination of Table V lends little or no support for the use of a 
one hundred item examination alone as a measure of course grade, but, 
neither can confidence in the equity of grades be placed upon the 
longer test of one hundred fifty items. A general estimate is that for 
the three different examinations about twenty per cent of the students 
would have had a final examination grade either one step lower or one 
step higher in the hierarchy of grades from F to A. It is not possible 


TABLE V.—NUMBER AND PERCENTAGE OF LETTER-GRADES CHANGED IN FINAL 
EXAMINATIONS WHEN ONE HuNDRED Firty Irems ARE REDUCED TO 
OnE HuNDRED ITEMS 


td 














1934 1935 A-B 1935 C-D 
forms forms 

N Per N Per N Per 

cent cent cent 

Letter grade unchanged............ 310 | 81.1 | 360 | 82.9 | 302 | 77.3 

Letter grade changed because of varia- 
tion of: 

ES eee 50 | 13.1 10; 2.3 14| 3.6 

icicle x kann 090 hk wad 16 | 4.2 15 | 3.5 17; 4.3 

I ag oe: bch ek. m0. 642-8 4; 1.0 23 | 5.3 22; 5.6 

So. 5 oe biuw has eid an l 3 11 2.5 16 | 4.1 

I eae ae eee 1 3 6 1.4 9; 2.3 

PS Ceri Wiatecenecatea. one E ove 9} 2.1 8 | 2.0 

ee fe rrr hie a woke on Boorse 3 8 
Bh ee 434] .... 391 
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to state which grade would be more representative of academic 
achievement. If the slightly higher reliability of the longer test 
argues in its favor, it must be admitted the evidence loses forcefulness 
when the correlations of Table IV are recalled. Perhaps the correct 
conclusion is that a single test of one hundred fifty items is inadequate 
for the determination of a grade in this course. On the other hand, it 
appears that judgment hinges upon fine distinctions when grades are 
assigned and that the closeness of agreement between the two classifica- 
tions should be the point stressed. The reader will undoubtedly 
temper interpretations according to the practical demands of his own 
academic circumstances. 

The interchanges of grades occur throughout the series from A to F 
but never by more than one step displacement. Students receiving B 
grades and D grades in the one hundred fifty item test change positions 
more than other groups, the B grades tend downward and the D grades 
upward. ‘This is illustrated in Table VI. 


TaBLeE VI.—PERCENTAGE OF LETTER GRADES CHANGED WHEN ONE HUNDRED 
ITEMS IN FINAL EXAMINATIONS ARE SUBSTITUTED FOR ONE HUNDRED 
Firty ITems 


























One One hundred fifty items 
hundred 
items F | D C B A 
4.6 81. 
A 10.1 82.2 
as 10.4 91.4 
8.1 74.2 18.9 
B 3.9 69.6 17.8 
5.8 70.1 8.6 
45.0 89.0 21.2 
C 25.4 93.5 20.3 
24.3 88.3 19.4 
8 52.5 2.9 
D 19.4 67.5 2.4 
20.0 68.5 5.9 
71.4 2.5 
F 80.7 7.0 
80.0 7.1 
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The bottom figure in each cell represents 1934 tests; center number, 
1935 Forms A-B, and the top number the 1935 tests, Forms C-D. 
Example: In the 1934 test eighty per cent of those who received F on 
the one hundred fifty items would have received F on the one hundred 
item test. This was true of 80.7 per cent of the 1935 group who took 
Form A-B, and of 71.4 per cent of the group who had Form C-D. 

Undue emphasis need not be placed upon the contrasts between the 
long and short final examinations since the crucial question concerns 
course grades as they are affected by combining either a one hundred 
fifty item or a one hundred item final with the two pre-final examina- 
tions. Table VII provides the basis for the contention that consider- 
able time could have been saved without injustice to students by the 
use of the shorter final examinations. Over ninety per cent of the 
course grades are not changed when the shorter final examination is 
substituted for the longer one. It is important to note that where 
letter grades change, that they are borderline cases due to fluctuations 
from one to three points in average score. These findings emphasize 
not only the minor fluctuations resulting from the substitution of a 
shorter test but they also stress the arbitrariness shown in assigning 


and in discriminating among borderline grades such as high F’s and 
low D’s. 


TABLE VII.—CHANGES IN CouRSE GRADES RESULTING FROM THE SUBSTITUTION 
oF ONE HuNDRED ITEM FINAL EXAMINATIONS 














1934 1935 A-B 1935 C-D 
forms forms 
Amount of agreement 
N Per N Per N Per 
cent cent cent 
ERTL Se eee aC rene ee 344 | 90.0 398 | 91.7 355 | 90.8 
Within one point.................. 30 7.9 21 4.8 15 3.7 
Within two points................. 7| 1.8 10; 2.3 17 | 4.3 
Within three points................ 1 3 5} 1.2 4; 1.2 
382 434 391 























The added labor required to construct and to score one hundred 


fifty item examinations cannot be justified upon the grounds of better 
discrimination of student achievement. Human judgment is strained 


in attempting to rationalize why two points more or less definitely 
separates levels of achievement in distributions where the range is as 
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great as in these final averages. Within the limits of two points, taken 
at various parts of the distribution curves, will be found agreement in 
course grades in ninety-nine per cent of the contrasts. The agreement 
appears to be neither a statistical artifact nor the product of unreliably 
recurring factors. Fluctuations from one year to another and from 
one form of examination to another are remarkably slight. Not a 
single grade is changed more than one step in the grade series. This 
point is emphasized by the following table. 


TaBLE VIII.—CHANGEs 1nN LETTER GRADES IN CouRsSE WHEN ONE HUNDRED 
Item Finat Is SustvuliTuTED FoR ONE HunprReEp Firty Item FINAL. 
GRADE IN CoursE Is AVERAGE OF ONE HUNDRED Firty ITEM 
FINAL AND Two ONE-HOUR TESTS 
































One hundred fifty items and two-hour tests 
One hundred items 
and two-hour tests PF D C B A 
Te | 3.0 88.9 
-  “€- ésuvs 4.7 86.0 
ened 2.0 93.0 
ieaee 4.0 91.2 Sea 
B 2.2 89.1 0 
tenes 2.5 87 .0 7.0 
eer 17.0 95.1 5.9 
— 4 4£F «wees 11.5 94.9 6.3 
bie rs 13.0 94.0 11.0 
19.2 81.4 1.0 
D 0.0 83.6 3.0 
7.0 75.0 3.5 
80.8 1.7 | 
F 100.0 4.9 
93 .0 12.0 




















The bottom figure in each cell represents percentages based on 1934 
examinations: Center number, 1935 Forms A and B, and top number 
1935 examination, Forms C and D. 

Of students receiving certain designated grades, the most stable 
are to be found in the average or C grade group and the least stable in 
the D grade group. For one form of the examination, namely C-D 
form 1935, the tendency for students in the F and D categories to merit 











; 
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a higher grade is more pronounced than in other forms. But agree- 
ment among the results and the fact of dispersals occurring throughout 
the grade scale is the general rule. Table VIII also summarizes these 
changes. 


SUMMARY AND CONCLUSIONS 


The number of items constituting a new-type test has been deter- 
mined largely by examination periods originally designed to permit the 
writing of essays. In spite of emphasis upon new-type tests demanding 
& minimum of writing and a maximum of information, educators 
generally have accepted the two- or three-hour final examination period 
and then have set out heroically to devise tests long enough to keep 
students busy for the whole time or to make them feel they had met a 
stiff challenge to academic prowess. The resiliency of youth in 
attempting to meet novel situations with knowledge aforethought, the 
necessarily changing emphasis in subject-matter dictated by new 
discoveries, the changing tide of textbooks, and the demand that tests 
measure learning in a broad sense rather than the memorizing of 
specific items, have all operated against the new-type tests fulfilling 
the promise of marked economy. Within recent years, a few educators 
have turned attention to problems intrinsic to the tests themselves and 
pertinent to the question of length. A few studies of reliability have 
appeared. Their authors, in several instances, have suggested either 
eliminating a final examination entirely or curtailing the number of 
shorter tests given periodically. 

Unfortunately, new-type tests must be devised within the frame- 
work of an outmoded examination system. This article is a challenge 
to the perpetuation of the two- or three-hour examination period. 
It is indicated that shorter tests can provide adequate measures of 
academic accomplishment. Perhaps arguments concerning the psy- 
chological or motivational effects of long examinations would carry 
more weight if they reflected interest in measurement more than the 
perpetuation of a system in which the professor felt he had struggled 
valiantly. Protective fears, couched in terms of making a student 
work for his own good, are an inadequate basis for meeting a serious 
problem. 

Nothing in the present study shows essay examinations are less 
expensive of an instructor’s time: and, perhaps stress is needed, noth- 
ing is appropriate immediately to the assumption that new-type tests 
are less expensive. Strangely enough, those educators lending strong 
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support to measurement within education have sometimes settled this 
question of economy by resort to verbal magic. A more disciplined 
logic would require proof. We need to try out both new-type and 
essay tests under conditions where large classes prevail, since large 
classes have offered impetus to the development of new-type tests. 
And in making these comparisons we must avail ourselves of inquiries 
into matters of validity and of reliability of both kinds of examinations. 

The following brief paragraphs summarize the major points of this 
article: 

1. The plausibility of reducing the length of new-type examinations 
to decrease the many hours required for their construction and scoring 
was the problem under investigation. To prepare and score two one 
hundred fifty item new-type tests, given to about eight hundred fifty 
students, took a total of two hundred and nine hours and twenty 
minutes. 

2. By the elimination of every third item, it was possible to study 
any changes in reliability and validity which might follow the reduc- 
tion of the number of items from one hundred fifty to one hundred. 
The reliability coefficients, as expected, dropped when the shortened 
test was substituted from .87 to .93 for the one hundred fifty item to 
.77 to .86 for the shortened tests. 

3. When a one hundred fifty item test is given, it does not mean 
that one hundred fifty questions are actually helping in the dis- 
crimination of students with marked grasp of the subject from those 
with little knowledge of the subject. Over thirty per cent of the 
single choice questions were useless, whereas only about thirteen per 
cent of the single word completion questions lacked discriminative 
value. A detailed study of the validity of various new-type questions 
to be published soon confirms these contrasts. 

4. The slight decrease in reliability due to shortening the one 
hundred fifty item examination was not significant enough to cause 
any important change in validity as ascertained by correlating final 
examination scores with the average score in two one-hour tests. 

5. From seventy-seven to eighty-four per cent of the letter grades 
in the original final examination were unchanged when the shortened 
test was substituted. Where changes in letter grades resulted, these 
changes were due to variations from one to seven points at the bound- 
aries arbitrarily separating one grade from another on distribution 
curves. No letter grade changes were greater than one step displace- 
ment. The B and D grade students showed the greatest amount of 
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fluctuation. When grades of B change they are more apt to be lowered 
whereas grades of D are more likely to be raised. 

6. When the shortened final examination was substituted for the 
longer test and the grade for the course was computed on the basis 
of two one-hour tests and the final examination, over ninety per cent 
of the grades remained unchanged. The few variations which did 
occur were due to minor changes of from one to three points at the 
boundary line of grade determination. 

7. The C grades are the most stable with the shortening of the 
final examinations, the F grades rank next in stability, and the D grades 
show the least stability. 

8. These data, we believe, question the necessity of giving long 
examinations to measure adequately the knowledge of the student. 
From the student’s point of view, little can be said in favor of long 
tests. Most grades remain constant after two one-hour tests and the 
few grades which change are due to slight fluctuations and, therefore, 
are borderline cases. From the standpoint of the teacher, time and 
effort can be saved with little loss in the reliability and no apparent 
decrease in the validity of the tests. 
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AN EXPERIMENTAL STUDY OF THE IMPROVEMENT 
IN READING BY COLLEGE STUDENTS! 


ALVHH R. LAUER 
Associate Professor of Psychology, Iowa State College, Ames, Iowa 
STATEMENT OF THE PROBLEM 


Ability to read and comprehend the printed page is one of the most 
important skills in modern life. In college the student must depend 
upon books for a large part of his education. After getting through 
college he must rely almost entirely upon reading for his advancement 
in theoretical matters. To keep abreast of affairs, professional, 
cultural, political or otherwise, requires the highest efficiency in reading. 

The object of the present study was to set up an experimental 
procedure which might be used by the average student to increase his 
reading ability. While comprehension was stressed in the mimeo- 
graphed form used as a guide, no attempt was made to measure it 
directly. It is quite generally agreed®* that speed in reading is posi- 
tively correlated with comprehension. Practically all of the cases 
studied reported that their comprehension had increased, although this 
fact was not submitted to objective verification. 

The following questions were set for experimental investigation: 


1. To what extent and in what way will students improve their reading 
habits under ordinary conditions of study? 


2. Are there sex differences in reading rate and amount of improvement 
under the conditions of this study? 


3. In what way is the rate of reading related to intelligence, general 
culture, and academic success? 


4. Do students improve their reading speed while in college? 


5. What type of material used in college is read most rapidly and what 
differences exist? Are there divisional differences? 


6. In what kinds of material is improvement likely to be most rapid? 


Starch!’ early predicted that the average student could improve 
his reading fifty to one hundred per cent if he were so inclined. Book? 
reports increases as great as one hundred two per cent. Other investi- 


gators have secured marked improvement in reading under regular 
experimental conditions. 





1 Read before Section Q of A.A.A.S. at Pittsburgh, December 28, 1934. 
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It is not the purpose of this paper to review the experimental 
literature. It suffices merely to mention contributions of Ahrens, 
Delabarre, Erdman and Dodge, Javal, Landolt, Lamare, and Volkmann 
and Lamansky; after which the pioneer work of Huey® served as a basis 
for the improved techniques of Buswell, Gates, Gray, Judd and others. 
The average college student has very little opportunity of profiting 
by these excellent researches. The function of this study, in part, was 
to devise a means for utilizing the results already obtained in the field 
of reading and of placing the more salient facts necessary to improve- 


ment in the hands of students who wished to improve their reading 
ability. 


METHOD AND PROCEDURE 


The study was conducted as a part of the course in educational 
psychology for a group made up largely of sophomores, juniors, seniors 
and graduate students. A few freshmen records were obtained. Most 
of the records were collected during the years 1931, 1932, 1933, and 
1934. About one-sixth of the student’s grade depended upon the 
completion of some experimental report. The students were told 
that the amount of improvement they made would not influence the 
grade they received. 

At the beginning of the study several classes were required to do the 
experiment as a term report. Later the improvement of reading was 
made optional and only those directly interested in self-improvement 
of this type were used as subjects. Improvement increased from thirty 
per cent to thirty-five per cent with this change in subjects. More than 
four hundred records were made, but only three hundred sixty-seven of 
these were used in the present analysis of results. Intelligence ratings 
were not available for all and the correlations for the most part were 
computed from three hundred twenty cases. Summations were made 
from a total of three hundred fifty-five cases, two hundred twenty-four 
women and one hundred thirty-one men. 

Each student was given a six-page mimeographed form on improve- 
ment of reading. This form consisted essentially of the following 
parts; (1) a preliminary discussion of the possibilities of reading 
improvement, (2) a description of types of reading, (3) a method for 
calculating the number of words to a page and of timing the reading 
period, (4) some precautions as to sources of error in experimental 
technique, (5) fifteen general principles for improving reading, and (6) a 
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form for tabulating results with instructions for making calculations 
and tabulating the data. 

Twenty practices were set as the improvement period. They were 
to be made every day or every other day. Only a few persons varied 
from this number of practices. The improvement was calculated by 
taking an average of the first three trials and subtracting this mean 
from the mean of the final three trials. For the intercorrelations, the 
absolute improvement in words a minute was used to avoid the possi- 
bility of spurious correlations when using a percentage score. For 
computing the percentage improvement in the summations, the number 
of words a minute improvement was divided by the mean of the first 
three trials. 

The readings were done under ordinary conditions of study and the 
reading material consisted of the regular assignments in two or more 
courses. ‘Tests were made for five or ten minutes of the study period. 
While there was some variation, most of the practices were made on 
three days of the week, as there are more three-hour courses given at 
Iowa State College. It was suggested that the timing periods be made 
alternately at the beginning and the end of study periods, but this could 
not be checked, and it is doubtful that it was carried out. 

There were variables which could not be controlled under the condi- 
tions described, but the following reliability coefficients establish the 
results as sufficiently accurate for the purposes used. Since different 
types of study material were used by each student the reliability is 
probably higher than the coefficients indicate. The results of two 
texts used by each student were correlated and, when corrected by the 
Spearman-Brown formula, yielded the following reliability coefficients: 


Reading rate at beginning of the experiment R = +.86 
Improvement in reading.................. R = +.82 


RESULTS 


The data were all reviewed, irregular records thrown out, calcula- 
tions of improvement were rechecked for accuracy, and the data 
punched on Hollerith cards from which the analysis was made. The 
following results were obtained. 

1. Ranges in initial rate of from seventy-three to ten hundred 
thirteen words a minute were recorded. The mean of all subjects was 
two hundred forty-eight words a minute. Rates of six hundred to 
seven hundred words were not uncommon. 
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2: Improvement ranged from zero to two hundred forty-nine per 
cent. The mean was 35.3 per cent. Improvement over one hundred 
per cent was quite frequent. 

3. Beginning reading rate with grade point average+.2452 N = 83 

4. Improvement with cultural knowledge +.0778 N = 51 

5. Beginning reading rate with cultural knowledge+.4997 N = 51 


TABLE I.—INTERCORRELATIONS 











Br Im Fr 
A + .2051 + .1179 + .2179 
Fr + .6701 + .6992 
Im + .2598 N = 320 











A = Aptitude. College entrance test mental alertness. 

Br = Beginning reading rate. Average of first three trials. 

Fr = Final reading rate. Average of last three trials. 

Im = Improvement in terms of words a minute. Difference between Br and 


Fr. 


Cultural knowledge was measured by a test developed at Iowa State 
College using one hundred fifty items covering the fine arts. The test 
has a reliability of +.92. This test covers knowledge of subjects like 
music, drama, literature, poetry, painting, architecture, and sculpture. 

Further analysis of data were made to ascertain the differences in 
reading ability by sex, by classification in college, by divisions, and by 
subjects. The results are shown in Tables II, III, IV and V. Sta- 
tistical evaluation was not made of the differences noted, but consist- 


ency in the tendency of different reading material indicates a fair 
reliability of the values obtained. 


TABLE II.—ComPpaRISON OF SEXES 





Average of A 
Text A means | Text B means | and B means 


—_—o weighted 





Initiall Per cent | Initial) Per cent | Initial| Per cent 
rate |increase| rate | increase| rate | increase 





ee ee 355 


Ee 224 | 255.7) 37.0 | 255.3) 39.0 | 255.5) 38 
Men................| 181 | 246.1) 32.4 | 244.9) 30.1 | 245.5) 3 
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A composite curve of the improvement of all students is shown in 


Fig. I. 


There seems to be no evidence of a plateau, and it may be 


assumed that the reading function had not reached a maximum in 
twenty days of practice. 


TABLE III.—ComPaRISON OF CLASSES 





Text A means 


Text B means 


Average of A 
and B means 









































Num- weighted 
ber 
Initial) Per cent | Initial) Per cent | Initial) Per cent 
rate | increase| rate | increase! rate | increase 
I ee aes Sak ad eeken 346 
Non-collegiate....... 1 225.7| 37.0 215.0) 24.0 | 220.4) 30.5 
Freshman........... 5 272.7; 35.8 | 272.9) 36.2 | 272.8) 36.0 
Sophomore.......... 72 | 254.3) 30.7 253.0} 33.2 | 253.6) 32.0 
3’ ar 172 254.8) 35.2 251.0| 36.7 252.9) 36.0 
a 64 242.8 36.0 244.9) 37.3 243.8! 36.6 
Graduate............ 31 258.8) 43.0 267 .8| 42.2 | 263.3) 42.6 
ee 1 214.7} 50.0 | 255.0| 16.0 | 234.8) 33.0 
TaBLeE IV.—ComPaARISON OF DIVISIONS 
Average of A 
Text A means | Text B means | and B means 
Num- weighted 
ber 
Initial} Per cent | Initial} Per cent | Initial) Per cent 
rate | increase! rate | increase| rate | increase 
ee es oa 341 
Home Economics....| 192 254.4; 35.9 252.1; 39.2 253 .2| 37.6 
Agriculture.......... 70 237.7; 29.8 | 238.3) 29.5 238.0} 29.6 
Industrial Science... . 59 | 262.2) 37.9 | 266.9) 37.2 | 264.6) 37.6 
Engineering......... 19 | 248.6) 41.3 231.1) 35.9 | 239.8) 38.6 
Veterinary Medicine. 1 150.7; 18.0 | 149.3) 34.0 | 150.0) 26.0 


























It would have been interesting to have measured the reading rate 
before and after the practice period to ascertain whether the actual 


technique of silent reading had improved. 


It must be remembered 


that the present study was designed to measure reading study condi- 
tions and if the results are as consistent as the reliability indices might 
warrant they may be more significant than short reading periods. 
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They represent reading under actual study conditions. 


The following 


conclusions are offered in full recognition of the limitations of the study 


TABLE V.—CoMPARISON OF SUBJECT-MATTER 





Text A means 


Text B means 


Average of A 
and B means 

















Num- weighted 
ber 

Initial} Per cent | Initial) Per cent | Initial) Per cent 

rate | increase| rate | increase} rate | increase 
eee 353 
Science....... 288 | 251.8/35.3)196 | 243.9] 36.0 248.6) 35.6 
History....... 13 251 .4/42.9) 25 | 242.2) 31.8 245.3) 35.6 
Social Science....... 43 248 1135.5) 86 | 258.5) 34.4 255.0) 34.8 
Literary............ 8 | 280.3/40.4) 45 | 278.4) 44.9 278.7| 42.7 
Miscellaneous....... 1 282 .3)14.0 




















Note: The number differs since in some categories a few could not be classified. 








Average for Academic Subjects 
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as obtained from results secured under the conditions described in the 
experiment. 
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SUMMARY AND CONCLUSIONS 


In general the conclusions offered here will be given in relation 
to the questions proposed at the first part of this paper, as follows: 

1. Students improved their reading rate on the average about 
thirty-five per cent over a period of twenty days under the conditions of 
this study. 

2. Women read somewhat faster and also improved somewhat more 
than men. The differences are consistent but rather small. 

3. The highest correlation found was between reading rate and cul- 
tural knowledge. An inspection of the scattergram showed practically 
no slow readers who ranked high in cultural knowledge, but a number 
who read rapidly fell below expectations on cultural information. The 
correlation between reading rate and intelligence was found to be 
+.205. 

4. The study indicates that students at Iowa State College do not 
increase their reading rate while in college unless some regular remedial 
program is carried out. There is some evidence that they read pro- 
gressively slower between the freshman and senior years, since it is 
assumed that the selective factors operate to eliminate some of the poor 
readers from college. 

5. The relative speeds of reading from highest to lowest were as 
follows: Literature, social sciences, history, and science. The differ- 
ences were relatively small, in no case exceeding thirteen per cent of the 
slowest read material. Students in agriculture were found to be the 
slowest readers. Students majoring in the more theoretical sciences 
were superior to students in the strictly applied sciences. 

6. Greatest improvement was found in literature and non-technical 
reading, although differences were not marked. 

7. Those who read more rapidly at the beginning generally 
improved most in terms of percentage improvement. In general, it 
seems advisable to provide remedial reading for all students in college. 
It is assumed that the permanence of improvement is a function of the 
amount of overlearning, spacing of practice periods, and other variables 
known to affect learning in a general way. 

8. In general it suffices to say that students can improve their 
reading ability by self-administered methods and the improvement of 
mature students may be even greater than that of younger students. 
Curves of improvement constructed from successive practices, indicate 
certain characteristics of learning to read more rapidly. 
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STUDIES OF EYE-MUSCLE IMBALANCE AND POOR 
FUSION IN READING DISABILITY: AN EVALUATION 


PAUL A. WITTY AND DAVID KOPEL 


Northwestern University 


Certain misconceptions concerning the potency of eye-muscle 
imbalance or heterophoria in causing educational maladjustment 
have gained currency in the pages of ophthalmological and medical, 
as well as in educational, journals. This ocular condition has been 
described in the following manner: 


When the eyes deviate, the retinal images no longer fall on corresponding 
points of the two retinae, and the resulting mental picture is blurred and con- 
fused. Double vision or lack of fusion may result and the visual image 
caused by the stimulation of one eye, may be imperfectly superimposed on 
that of the other... . 


The superimposition of the images of words and letters that often occurs 
in exophoria creates mental impressions of a composite word or letter form, 
which may be quite unfamiliar, or may blend into a familiar looking symbol. 


This concept led students in educational research to seek the cause 
of poor reading attainment in impaired eye-muscle conditions. 


EARLY INVESTIGATIONS 


Eames,’ in 1931, reported that a group of sixty-four poor readers, 
compared with a class of eighty-seven unselected school children, 
exhibited a markedly higher incidence of exophoria* (especially at the 
reading distance); moreover, the amplitude of fusion convergence of 
the poor readers was only one-half of that displayed by the unselected 
group. Eye-muscle coérdination was measured by careful use of the 
Stevens phorometer, the Risley rotary prisms and Maddox rods, 
double prisms, and the Wells-DeZeng phoropter. Another similar 
comparison of larger groups, published the following year by Eames,® 
yielded similar differences; these differences were considered statisti- 
cally reliable and ‘“‘very significant”? in causing reading disability. 
Eames failed to report the method employed in selecting all of his 





* Exophoria, the more common type of lateral muscle imbalance, denotes a 
tendency of the visual lines to deviate outward from the parallel condition when 
the eyes are at rest. Esophoria refers to the tendency of the visual lines to deviate 
inward; this condition produces results similar to those of exophoria. Vertical 
muscle imbalance occurs rarely. 
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disability cases. Some, however, were chosen from his private 
ophthalmological practice. It is not unlikely that his poor readers 
constitute a selected group, sent to him partly or largely because 
vision was faulty. 

Selzer'! in 1933 constructed stereoscopic tests for studying muscle 
balance and fusion, and examined thirty-three poor readers and one 
hundred unselected children. Heterophoria was displayed by ninety 
per cent of the disability group; only nine of the other children evinced 
similar defects. Selzer concluded that ‘conditions of muscle imbal- 
ance and alternating of vision, in addition to a lack of fusion, .. . 
account for such reading disability as are not accounted for by general 
mental disability. The lack of visual fusion is due to muscle imbalance 
that has existed from birth or early infancy.”’ Selzer has apparently 
not continued his research since this inconclusive and vague report was 
published. It should be noted that the poor readers who displayed 
heterophoria upon Selzer’s laboratory tests were considered by oculists 
to exhibit ‘imbalance within the normal range for children of that 
age.’’ These specialists, reports Selzer, stated that ‘‘the imbalance 
was latent only and did not contribute to the reading disability.” 

Apparently influenced by these studies (and by the Wells stereo- 
scopic technique’*) Betts! in 1934 developed a series of stereoscopic 
slides to appraise the codrdinate action of the eyes. These slides, 
known as the “Betts Tests of Visual Sensation and Perception,”’ are 
said to permit the hitherto impossible ‘‘scientific study . . . of the 
binocular coédrdination required in reading.’”’ After studying the 
visual characteristics of an unstated number of “disabled readers”’ 
Betts! announced (like Selzer) that ninety per cent displayed eye 
muscle imbalance (and astigmatism); he concluded that, ‘‘ Many of 
our reading problems are directly traceable to a lack of coérdination 
between the two eyes and to the probable failure of the mind to combine 
the right-eye and left-eye pictures for correct interpretation.”’ In 
evaluating this study it should be observed that Betts neglected not 
only to state the number of cases in his disability group, but he failed 
also to employ a control group. It is important, moreover, that in 
recent articles his emphasis has been modified: heterophoria has 
changed from a ‘‘cause”’ to a “‘correlate”’ of reading disability.‘ His 
recent book on reading, which includes a large section devoted to visual 
conditions, contains no reference to recent research results cited below.* 

Of interest in this chronologica! succession of ocular researches 
is the unanimity in findings, and the overemphasis—as it now appears, 








oe WCW WwW Ww ee cr YF EEUU UN 


i) 


Studies in Reading Disability 665 


in the interpretation of the investigations just cited—upon the 
etiological significance of single physiological factors in disturbing the 
development of the reading function. Noteworthy, therefore, is a 
recent series of investigations concerned with problems basically 
similar to those in the studies already reported. This later work, 
however, has yielded results strikingly different from those of Eames, 
Selzer, and Betts. 


LATER STUDIES 


First perhaps in this later series is the extensive investigation of 
Farris,* which was conducted in 1934. Visual, educational and mental 
data were obtained from one thousand six hundred eighty-five seventh- 
grade pupils of the Oakland public schools. Visual examinations, 
including tests of acuity, astigmatism, accommodation-convergence, 
fusion, and muscle balance, were made by specialists from the Univer- 
sity of California Division of Optometry. The Department of 
Research of the Oakland Public Schools secured achievement and 
intelligence test scores. Children with defective vision were matched 
with a control group having approximately equivalent chronological 
and mental ages.* Farris compared the reading scores of the groups 
and concluded that ‘‘Types of eye defects other than the myopic, 
hyperopic and the strabismic types have little effect upon progress in 
reading.’’+ Stated more directly, irregularities in fusion and muscle 
balance were found to be unassociated with reading achievement. f 





* The number in these groups is not reported in the abstract of Farris’ disser- 
tation. 

t Farris* reports that ‘‘ Both hyperopia and strabismus are associated with less 
than normal progress in reading; while myopia and myopic astigmatism were both 
found to be associated with more than normal progress.’”’ The first-named defects 
are probably congenital; 7.e., they are found in children of pre-school age and their 
incidence becomes less in children of school age. On the other hand, myopic 
errors occur infrequently before school entrance, and develop at an increasing rate 
as children pass through the grades.'!° Thus we can understand why children 
who make ‘‘more than normal” school progress should be characterized by a super 
normal incidence of myopia: These children read better by virtue of their diligence 
or interest or mental maturity and have doubtless turned more often than their 
fellows (showing normal progress) to reading activities. Despite the visual 
handicap, the child’s ability (at this and higher levels) to compensate physiologically 
for the visual error (at the cost of fatigue) enables him to demonstrate and maintain 
a reading superiority. 

ft The abstract of Farris’ study contains the following somewhat ambiguous 
statement: ‘‘ Pupils whose visual perception is monocular make progress in reading 

















666 The Journal of Educational Psychology 


TaBLE I.—A CoMPARISON OF THE OcULAR BEHAVIOR OF ‘‘ NORMAL” INDIVIDUALS 
AND THosE Havine HicH BInocuLar IMBALANCES 
(Adapted from Clark’s Study‘) 








Experimental Control D/o 
group group diff. 
1. Average number of fixations per line..... 13.81 + 1.60)13.59 + 1.59) 0.25 
2. Average number of regressions per line...| 1.00 + 0.63) 0.85 + 0.50) 0.46 
3. Average number of ‘“‘initial regressions”’ 
EE ee ee eee Peer 1.15 + 0.31) 1.04 + 0.29) 0.65 
4. Average duration of regressions in 45 
hai ie ei kre eee WON 5.24 + 0.88) 5.42 + 0.87| 0.37 
5. Time prior to initial forward movement in 
nao yds oa ab eee ee 12.38 + 0.96)/12.08 + 1.35] 0.40 
6. Average reading time per line in seconds. .| 3.35 + 0.59) 3.74 + 0.59) 1.00 
7. Duration of divergence movements in 45 
leche ek te kae cee eee en 2.41 + 0.35) 2.34 + 0.55] 0.28 
8. Extent of divergence movements in min- 
a Ne A al at ak he el 39.1 + 10.7'|30.0 +9.5’| 1.70 














It has been remarked that from an a priori standpoint, ‘‘ anomalies 
of binocular balance should cause definite irregularities of the eye 
movements and be a significant factor from the point of view of 
remedial reading.’’> In his experimental investigation of this problem 


in 1935, Clark® therefore employed eye movement photography to 


determine differences in the binocular behavior of the eyes of ‘‘normal”’ 
individuals and of those having marked exophoria. His subjects were 
selected from a group of one hundred ninety-one college freshmen 
at the University of Southern California whose muscle balance was 
measured with two three-diopter displacing prisms. The experimental 
group was the upper decile of the students displaying muscle imbalance; 
it consisted of eleven students having a near point exophoria ranging 
from twelve to sixteen diopters. Matched with this group on the 
basis of sex and score in reading comprehension and linguistic ability 
were eleven subjects in the lower decile of the distribution; these 
students displayed approximately normal binocular balance—zero to 
two diopters of exophoria. 





superior to those not having correct coédrdination of the two eyes.’’ The former, 
it should be noted, are a small group of children having ‘‘monocular’”’ vision, 7.¢., 
vision in only one eye, a condition which avoids all possible difficulties of fusion and 
coérdination, and is therefore favorable (when refraction is normal) for reading 
development. Indeed, Betts? states: ‘‘If we were a one-eyed race our reading 
difficulties would be few.” 
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In a carefully controlled procedure both groups were presented with 
two paragraphs of reading material (from the field of psychology). 
Eye movements, during reading, were photographed with an instru- 
ment yielding simultaneous records of the horizontal and vertical 
movements of both eyes. Quantitative comparisons of the control and 
exophoric groups are reproduced in Table I. 

To be noted in all the comparisons of regression duration and 
frequency, number of fixations, reading time, and so forth (items 
“1-7” of Table I) is the minuteness and the statistical insignificance 
of the differences between the groups. * 

The greater, but still statistically insignificant difference in item 
“8” reflects the somewhat larger convergence and divergence move- 
ments made by the exophoric individuals at the beginning of each line. 
These movements are of some importance since they may cause 
fatigue.t The factor of fatigue, however, could not be evaluated since 
it was precluded by the short period of time—fifteen minutes—used 
in administering the tests. Clark points out that, according to Eames, 
exophoria often results in momentary overlapping and confusion of 
letters and words in reading. It appears significant, therefore, that 
none of Clark’s subjects reported “‘any doubling of the print in spite 
of the fact that divergence movements amounting to as much as three 
degrees were made at the beginning of the line, and the time required 
to complete these movements was as much as one-fifth second . . . 
the subjects made no ocular movements which could possibly cause 
diplopia with the possible exception of those at the beginning of the 
line, and none were reported there in spite of the large divergence 
movements. However, it is entirely possible that such movements 
causing diplopia would occur at the end of long reading periods or when 
fatigue is present.’”’ Clark does not signify the possibility of such 
movements developing in ‘“‘normal”’ as well as in exophoric individuals 
who, after varying amounts of reading, become fatigued. 

Ariother very detailed study of the réle of anomalies and imperfec- 
tions in vision and their relationship to reading was made by Swanson 





* These findings are corroborated by Swanson and Tiffin’? (in the study 
reported in greater detail below). These investigators photographed the eye 
movements of groups of readers who displayed normal and poor fusion at reading 
distance. In mean number of eye-fixations and in average duration of eye-fixation 
these groups were practically identical. 

+ Clark® concludes that ‘‘any excessive reading fatigue which may result from a 
condition of exophoria is certainly not a muscular fatigue ... [but] is due 
definitely to sensory processes in certain respects similar to flicker fatigue.” 
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and Tiffin’? at the University of lowa. Their subjects were college 
freshmen classed, on the basis of their differential performance on 
reading tests, as poor (lowest decile), good (highest decile), and 
unselected readers. All subjects were examined with the Betts tests of 
Visual Sensation and Perception, and measures were obtained of the 
following functions: Visual acuity, astigmatism, depth perception, 
lateral muscle balance, vertical muscle balance, and fusion at far and 
at reading distances. Summarized in Table II below are the data 
most relevant to the present discussion. These exhibit clearly ‘the 
tendency toward uniformity of response in fusion [and muscle balance] 
among the groups... . ” 


TaBLeE II.—Sauient Data FROM SWANSON AND TIFFIN’s Stupy!2 SHowinc 
INCIDENCE OF MuscLE IMBALANCE AND Poor Fusion In Various GROUPS 





Unselected Poor Good 
group,* | readers,* | readers,* 


per cent per cent per cent 


Visual defect categories 














Vertical muscle imbalance.................. 3 3 0 

Lateral muscle imbalance (reading distance). . 21 15 22 

Inadequate fusion (reading distance) : 
ae hd ewig aca g ky. Mies, ai Be 19 19 17 
RR eS eT 7 17 13 





* Customary refractive corrections were worn by individuals; data based upon 
examinations without corrections yielded similar results. 


Analysis of the responses to each and to all of the visual tests led 
Swanson and Tiffin to the following conclusion: ‘‘The fact that the 
uncorrected eyes of the good reader are poorer in general, except in the 
case of astigmatism, than the uncorrected eyes of the poor reader, and 
the fact that the corrected eyes of both groups are approximately the 
same, make it seem improbable that differences in visual efficiency are 
causally related to differences in reading ability among college students. 
The above conclusions are supported by an approach through statis- 
tical correlations in which it was found that Betts’ tests are not 
appreciably related to reading ability. This statement is equally 
true whether intelligence is left uncontrolled or whether it is held 
constant by means of partial correlation.”’ 

Fendrick® has recently reported a rather suggestive study of the 
Visual Characteristics of Poor Readers.” Sixty-four pairs. of poor 
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and good readers in grades two and three of several New York City 
schools were tested for muscle balance with the Maddox Rod Test and 
the Betts apparatus; additional data were secured by optometrists 
from the Division of Optometry, Columbia University. The results 
of all tests and procedures revealed the same general trend: A lack of 
relationship between ocular anomalies or deficiencies and degree of 
reading disability (except when teaching methods rely preponderantly 
upon visual techniques). ‘‘ Measures of lateral eye-muscle coérdina- 
tion did: not yield any evidence that reading disability cases mani- 
fested a more pronounced aberrance in muscle-imbalance than the 
control cases. The reliability of this finding was established through 
three distinct approaches which consistently failed to produce any 
significant variation in the group comparisons.” 

The writers of this paper, also, have investigated the relationship 
between visual defect and reading disability.14'5 Their experimental 
group consisted of the one hundred poorest readers in grades three to 
six, inclusive, of the Evanston Public Schools (District 75). A control 
group was selected in proportionate sex ratio and in proportionate 
numbers from the same grades and schools. All children were exam- 
ined with the Betts visual tests; and the responses were tabulated to 
yield comparative data showing the incidence of various visual defects 
in the two groups. In addition, a new technique was employed: The 
probable effect of visual status upon reading achievement was studied 
by comparing the average reading attainment of the poor and good 
readers classified in each visual defect category. 

A detailed account of the data has been given elsewhere.!® Typical 
of the presentation was the finding that eighteen per cent of the poor 
readers and twenty per cent of the good readers displayed no fusion; 
of the poor group, only eight per cent exhibited lateral muscle imbal- 
ance, of the good group, fifteen per cent. And the effect of these 
(and other) visual aberrations upon reading achievement in poor 
and good readers was found to be negligible—mensurable upon 
standardized tests only in deviations of hundredths and thousandths 
of a grade. 

Analysis of the data led to the conclusion that “‘the poor readers 
are not characterized by a greater incidence of visual defects and 
anomalies than are good readers. With the exception of the slow 
fusion group the percentages among the non-problem are somewhat 
higher than among the problem children in practically all other visual 
(defect) categories studied. Furthermore, the various visual factors— 











bs 
} 





— 





670 The Journal of Educational Psychology 


slow fusion, no fusion, lateral muscle imbalance, deficient acuity, and 
ametropia, singly and in combination—appear unrelated to reading 
deficiency.”'> This conclusion seems to be corroborated by the entire 
array of recent studies (conducted at several age levels with a variety 
of techniques and carefully controlled procedures). As an hypothesis 
upon which to proceed in other studies, it deserves consideration; 
moreover, it should serve to modify the practices of educational investi- 
gators who have constructed diagnostic procedures and remedial 
techniques the basis of which are the apparently unverified claims of a 
small group of research workers who have emphasized the réle which 
certain visual defects allegedly play as single factors in the causation 
of reading disability (Cf. Witty and Kopel'*). 

Nothing in this conclusion should lead one to minimize the impor- 
tance of good vision for optimum physical efficiency and achievement 
of both poor and good readers. Moreover, visual defects may, in 
individual cases, seriously impede the reading process or contribute to 
its disfunction. It is therefore highly desirable that each child, upon 
entrance to school and at regular intervals thereafter, should receive 
thorough ophthalmological study. As a corollary, such examination 
should be an essential procedure in the diagnosis of every case of 
educational disability. 
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TEACHERS’ JUDGMENTS OF PUPIL ADJUSTMENT 


H. MAX HOUTCHENS 


Iowa Child Welfare Research Station, State University of Iowa 


Although it has been the aim of the public school to realize its 
potentialities as an agency of character education, it has not reached 
that goal due to a lack of staff informed in character and personality 
development. Such a lack seems to be clearly indicated through sev- 
eral research studies. Among the most noteworthy are the well- 
known findings of Wickman, who was able to show by measuring the 
customary attitudes of teachers toward behavior problems in children 
that teachers consistently regard as serious those forms of behavior 
which constitute an attack on established order, or a frustration of the 
immediate purpose of teaching. Further, the withdrawing or submis- 
sive forms of behavior were rated consistently low in importance. 
These judgments were practically reversed by mental hygienists. 

Laycock,® in a study of teachers’ reactions to maladjustment of 
school children, found results that were in exact agreement with Wick- 
man. Others who obtained very similar data are Yourman,'! Mac- 
Clenathan,'* McClure,’ and Boynton and McGaw.? 

It would seem unquestionably evident, then, that teachers identify 
as problems those children whose behavior is aggressive and disturbing, 
and fail to recognize as problems those children whose behavior is of a 
withdrawing, evasive sort. 

This study aims to present evidence in line with the above findings, 
with the application of a technique for measuring the adjustment of 
children whom teachers have rated. The problem was formulated 
during the process of selecting a control group of normal, well-adjusted 
children for matching purposes. 


METHODOLOGY AND PROCEDURE 


Twelve teachers in a large junior high school* were asked to submit 
the names of the two or three boys whom they considered the best 
adjusted in their classroom. ‘They were also requested to select for the 
investigator the two or three boys whom they considered the least 
adjusted. The investigator discussed with each teacher the general 
aspects of adjustment from a mental hygiene standpoint. Specific 





* Acknowledgment is gratefully made te the Des Moines City School super- 
visors and staff for the excellent coéperation and aid given toward this study. 
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criteria of selection were left to the teachers’ judgments and each 
teacher submitted two or three names for each group. Using this 
method it was possible to acquire a group of twenty-eight children 
between the ages of twelve and sixteen who were rated as the best 
adjusted and a group of forty children of the same age-group who were 
considered as poorly adjusted. 

For a control group the investigator selected at random a group 
of thirty boys who represented all of the rooms involved in the experi- 
mental groups. These random selections were then submitted to their 
respective teachers for assurance ratings as to whether any of these 
children had presented any behavior or conduct problems to the school. 
Other random selections were used to replace those on which the 
teacher could not assure the examiner. In this manner a control group 
was gained that consisted of a random selection of ‘‘non-problem”’ 
children. The three groups used in this study are quite comparable in 
chronological age and school achievement, since the grading system of 
the school is based upon the homogeneous grouping of these two 
factors. 

The association-motor technique was used in this study for measur- 
ing adjustment and maladjustment. Although this technique has 
been the subject of much discussion and criticism, its ability to differ- 
entiate adjusted from maladjusted groups has been clearly shown. 
Demonstrations of this have been made by Luria,’ Barnacle, Ebaugh, 
and Lemere,' Huston, Shakow, and Erickson,* Kephart,® and Hout- 
chens.* In a previous study it was demonstrated that a juvenile 
delinquent group tended to distribute itself in a bi-modal distribution 
over twice the range a randomly selected normal group distributed 
itself in performance on this test. Furthermore, at least fifty per cent 
of the delinquent group ranked above the scoring range indicated by 
the normal group. It was also conjectured that the lower mode group 
was characterized by low social standards. Thus, aggressive asocial 
standards were acceptable to their group and they consequently were 
psychologically well-adjusted. 

The modification of the motor-association technique, instructions, 
procedure, and scoring method used were identical to that employed 
in a former study.* The Kent-Rosanoff word association list was used 
as the stimulus with recordings of reaction-times, verbal responses, 
voluntary responses, and involuntary responses being made simul- 
taneously on a polygraph set-up. The examiner made a particular 
effort to secure rapport with the subject before the examination started 
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and care was especially taken to prevent the boys from knowing or 
feeling that different groups of them were being studied. 


RESULTS 


For purposes of comparison, the distributions of former groups 
studied are given in the same graph showing results obtained in this 
study (Fig.1). It will be noted that the problem group selected by the 
teachers closely approximates the distribution of the juvenile delin- 
quent group. Also, the randomly selected normal group approximates 
the former control group. This would tend to indicate that teachers’ 
estimates of problem children are positively correlated with the overt 
conduct disorders usually found in the juvenile court as measured by 
this technique. Furthermore, the technique has a fairly high reliabil- 
ity shown by the similar distributions of two randomly selected control 
groups. 

The most striking finding, however, relates to the so-called ‘“‘best 
adjusted” group selected by the teachers. The distribution of these 
cases appears to be very comparable to the upper division of the 
juvenile delinquents’ and problem childrens’ distributions, a result 
which is entirely out of agreement with what expectations should be. 

If any generalizations can be made it would seem that these teachers 
not only felt that the most serious behavior patterns were those that 
interfered with the smooth functioning of school routine because of 
aggressiveness and failed to recognize the seriousness of withdrawing, 
evasive patterns, but that they actually selected maladjusted children 
of another extreme as their ‘‘ best adjusted”’ cases. 

Although it is not safe to make any iron-clad statements regarding 
the results contained herein, it appears to be evident that an atypical 
group of some sort was selected by these teachers as being their ‘‘ best 
adjusted”’ pupils, since very little overlapping occurs with the randomly 
selected normal group. Furthermore, the type of adjustment shown by 
this group may be seriously questioned, since the symptoms uncovered 
by the motor-association test are characteristic of a maladjusted 
population. It is at least in order to conclude that, since the various 
groups have differentiated themselves so clearly by the use of this 
technique, there is a strong indication that teachers’ judgments of 
behavior disorders are not in agreement with mental hygienists, and, 
conversely, that teachers’ judgments of adjustment seem to be opposed 
to mental hygienists’ judgments if the symptoms shown by the motor- 
association test may be taken at their face value. 
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Fic. 1.—Graphical representation of motor-association test scores for selected groups 
of junior high school boys. 
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A NEW SCALE FOR RATING SCHOOL BEHAVIOR AND 
ATTITUDES IN THE ELEMENTARY SCHOOL! 


DOROTHY VAN ALSTYNE 


Francis Parker School 
Chicago, IIl. 


The plan of this study was to formulate a scale which would rate 
the emotional and social aspects of the personality of children from 
the nursery school through the sixth grade. 


BACKGROUND OF THE STUDY 


Rating scales are now accepted as one of the best methods of 
character diagnosis,? provided the scales are specific and the judges 
thoroughly acquainted with the child. Particularly in the elementary 
school is the use of the rating scale both possible and valuable. The 
teacher who knows the interpretive background of the whole context of 
the child’s environment and sees him in all his relationships from day 
to day has one of the best means of obtaining evidence concerning the 
child’s personality. The usefulness of such evidence for getting a 
“whole picture” of a child by a person who knows him so well cannot 
be overestimated. 

Prior to beginning this study the Haggerty-Olson-Wickman Behav- 
ior Rating Scale* was the only teachers’ rating scale available for 
making a general survey of the characteristics of an elementary-school 
child in the early grades. Since this study was started the Character 
and Personality Rating by Maller has been published by the Teachers 
College Bureau of Publications, Columbia University. This scale 
consists of fifty aspects of character and personality to berated. Three 
points are described briefly—the two extremes and the average. For 
example, 


Leadership.—1. Never leads in social activity. 
2. Occasionally acts as leader. 
3. Is a born leader; has high degree of initiative. 





‘This study was made under a grant from the Behavior Research Fund, 
Chicago, Illinois. It could not have been carried out without the assistance of 
LaBerta Weiss Hattwick and Helen Totten and the generous coéperation of 
Rose H. Alschuler, Carleton Washburne, and the Winnetka School Faculty. 

?Symonds, P. M.: Diagnosing Personality and Conduct. New York: The Cen- 
tury Co., 1931, p. 566. 

3 Published by the World Book Co., Yonkers, N. Y. 
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In a follow-up study of Winnetka Nursery School children a need was 
felt for a scale which would be more specific than this one, a scale which 
would have in most instances more than five points so that descriptions 
could be found which would be suitable for all types of children and 
especially a scale which would be in terms of actual classroom situa- 
tions so that the teacher could more accurately make his judgment. 

The literature on the subject of rating scales was searched for help 
in the problem. Many of the scales for rating nursery-school children, 
adolescents, and adults were suggestive in formulating plans for a new 
scale. The studies by Furfey,! May,? and Willoughby* were espe- 
cially used in organizing the plan of work for the present study. 


CRITERIA FOR THE CONSTRUCTION OF THE RATING SCALE 


The criteria for this scale were as follows: 


1. The scale must deal with actual classroom situations and be in terms 
meaningful to teachers. . 

2. There must be a number of situations to cover any specific area in the 
field of personality in which it seemed desirable to rate the children. 

3. Response levels to these situations should be in terms of specific inci- 
dents so far as possible. 

4. The response levels should be gathered from teachers of all grades 
from the nursery school through the sixth grade and should represent their 
observations on individual children. 

5. The final scale should be tried out on enough children so that adequate 
grade norms, distributions, and general criticism of the scale would result. 


METHODS USED IN THE FORMATION OF THE SCALE 


1. Teachers were asked to select the two ‘“‘best-organized”’ and the 
two ‘‘most poorly-organized”’ children in their classes. ‘‘Best- 
organized’”’ was interpreted to mean ‘“‘those children with whom the 
teacher was making no particular effort at adjustment because the 
child’s development was proceeding smoothly” and, conversely, 
‘‘noorly-organized’’ was to refer to ‘‘those children on whom the 





1 Furfey, Paul H.: ‘‘The Measurement of Developmental Age.’”’ The Catholic 
University of America, Educational Research Bulletin, Vol. I1, No. 10, Dec., 1927, 
pp. 40. Catholic Educational Press, Washington, D. C. 

2? May, Mark A.: ‘‘ Problems of Measuring Character and Personality.” J. of 
Psych., Vol. III, No. 2, May, 1932, pp. 131-145. 

8’ Willoughby, Raymond R.: ‘‘A Scale of Emotional Maturity.”’ J. of Soc. 
Psych., Vol. III, No. 1, Feb., 1932, pp. 3-35. 
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teacher was expending time and effort to bring about a balance in the 
child’s personality, both socially and emotionally.” 

2. The teachers were asked to report incidents of classroom 
behavior on which they based their judgments and to keep a descrip- 
tion of further incidents over a four-week period. 

3. These incidents were then sorted out into specific classroom 
situations such as ‘‘When visitors are present,’”’ ‘‘When a child’s 
activity is interfered with by another child,” ‘‘ When opportunity for 
free activity is given,” etc. As incidents were sorted it became 
apparent that the child’s response to each of these situations could be 
classified in a tentative order or desirability. It would have been 
valuable, if possible, to have exactly the same specific incident indicated 
in each response level but this was not possible because of the age range 
covered. For example, in the nursery school a child’s activity might 
be interfered with by another child who was continually bumping into 
him on his bicycle but in the fifth-grade level the interference might 
consist chiefly in the disturbing questions asked by another child. 
Also, a large number of these very specific items would be necessary to 
make any kind of judgment on the child’s ability to deal with inter- 
ference. Therefore, some sort of generalization had to be made to 
cover the different sorts of incidents in the different grades as well as 
those within the same grade. For purpose of convenience the situa- 
tions were tentatively classified under the heading (1) Codéperation 
with Adults—Response to Adults, (2) Emotional Security, (3) Group- 
mindedness-Sociability-Response to Other Children, (4) Responsibility, 
Dependability, Self-Reliance, Self-Respect, and (5) Initiative. 

4. Each teacher was asked to divide her class into six groups: 
Group I, the children whom she considered the best-organized, Group 
II, next best-organized, etc. The groups did not have to have an 
equal number of children. The teacher was asked to evaluate the 
response levels thus far outlined under each situation heading in terms 
of the children in each of the six groups. The following directions 
were given: ‘Take each situation in order and think of the probable or 
usual response to that situation by the children listed in Group II. 
If the response can be characterized by a level already noted, place a 
‘“‘1” in front of that level. If it is not present, formulate the level 
which does characterize this group and insert in the outline where you 
judge it belongs. Do the same for the other five groups.” Elimina- 
tions and criticisms of headings were also requested. 
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5. A tentative form of the scale containing thirty situations evolved 
from this analysis. This was mimeographed and presented in booklet 
form to the teachers of the seminar group, with the request that they 
rate each of the children in their class in a separate booklet. Directions 
were given that for a given child, if no response level were found, a 
statement was to be written in its adjudged proper place. In this way 
about thirty children were rated on each grade level from nursery 
school through sixth grade—a total of about two hundred fifty children. 

6. These tentative forms containing the thirty situations were 
revised on the basis of the evidence gathered from the rating of the 
two hundred fifty children. The frequency of each response level! for 
each grade as well as the medians were calculated. In the revision the 
following points were kept in mind: 


1. Insertion of new response levels which had been indicated by the 
teachers. 

2. Formation of new response levels where the frequencies indicated the 
possible need for other responses in that child. 

3. Elimination of ambiguous wording or wording which might ‘“‘color”’ 
the whole response level and so should be avoided. 

4. The judgment by the teachers in the group of the order of desirability 
of the response levels. 


7. The revised scale of thirty-three situations was then mimeo- 
graphed in booklet form. In this new form the general headings such 
as Responsibility, Codperation, etc., were omitted and the response 
levels were arranged in random order. This was done so that there 
would be no bias in the teachers’ judgment concerning either the influ- 
ence of a general heading or the influence a ‘“‘good to bad”’ order might 
have in judging individual children. Also such an arrangement was 
thought to make for a finer scrutiny of each response as individual 
children were judged. 

8. Over twelve hundred children were rated by means of the revised 
scale. Approximately eleven hundred were in the Winnetka Public 
Schools from nursery school through the sixth grade; fifty were in the 
Emergency Nursery Schools in Chicago, and seventy-five were in two 
rural schools in Kansas. The ratings were made in March, 1934. 





1 The term ‘‘response level’’ is used throughout the article to mean that level of 
response which characterizes the child’s behavior response to a given situation. 
In the final form of the test response levels are arranged in order of desirability, 
that is, from the level of behavior considered best by the teachers to the level of 
behavior considered worst by them. (See Table I.) 
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STATISTICAL METHODS AND RESULTS 


Decile Scores.—Frequencies for each response in the thirty-three 
situations were calculated for the eleven hundred twenty-eight 
Winnetka children. Due to poor distribution (for example, too great 
a frequency on the highest reponse level) five situations were omitted 
from the final consideration. On the basis of the percentage of fre- 
quency for each response level, decile scores were devised. 

Medians.—Each child’s ratings were scored and grade medians 
from the nursery school through the sixth grade were computed. 
With the exception of the sixth grade the medians were approximately 
the same. In six of the thirteen situations finally selected, the sixth 
grade had a median one or more steps higher than the other grades. 
In one instance the nursery school is one step lower than the other 
grades. (See Table I.) It is possible that this discrepancy may be 
partly accounted for by the fewer cases at each of these grade levels. 

Multiple-factor Analysis.—The thirty-three situations had been 
tentatively classified under the headings previously mentioned. A 
more adequate means of finding the situations which belong in separate 
categories was supplied by the multiple factor method.! This method 
finds those elements that are common in a group of tests (or rather 
responses, as in this case). 

Using the eleven hundred cases the scores on the thirty-three situa- 
tions were intercorrelated by means of the tetrachoric r for approxi- 
mately twenty-five of the correlations. These same correlations were 
then calculated for the three hundred sixty-one cases of grades I and V. 
The correlations of these three hundred sixty-one cases corresponded 
so closely to those of the eleven hundred that the remainder of the five 
hundred twenty-eight correlations were made on the bases of the smaller 
sampling. The difference was never more than .10 and usually much 
less. 

Thurstone’s method was used for the computation of the factors. 
Three factors were found to be present in these items. Factor 1 was 
contained to a substantial amount in the majority of the thirty-three 
situations. It is the highest common factor in the whole scale. 





1 Thurstone, L. L.: ‘‘The Vectors of Mind.’’ Psych. Rev., Vol. XLI, No. 1, 
Jan., 1934, Vol. XLI, No. 1. The writer is indebted to Dr. Thurstone for the 
suggestion to use this method and to Dr. Neil J. Van Steenberg, Fellow in Psychol- 
ogy at the University of Chicago, who did the multiple factor statistical work and 
gave much helpful advice on all statistical aspects. 
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According to the items which contained the largest amount (see 
Table IV, Situations III, XI, XIII; also II, VII, XII), this factor may 
be the ability to impress the teacher that the individual is a worth- 
while pupil. It probably contains intelligence in the sense of adapt- 
ability to environment and the pupil’s response to social approval. 
It is the aggregate of the teacher’s judgment of a good or a poor pupil 
—a ‘‘generalized teacher’s estimate.’’ Possible names for this factor 
might be ‘‘Pupil Desirability,’”’ ‘‘School Behavior,” ‘‘School Adjust- 
ment,” or ‘‘Teacher’s Average Judgment of Pupil’s Behavior.” 

Situations VIII, IX and X, (see Table IV), have a fairly large 
amount of Factor 2. This factor is low in situations I and XIII of this 
scale. (Of the twenty situations finally discarded, there were a few 
others low in this factor.) Factor 2 may be said to differentiate 
between (1) the pupil who is a leader, is self-confident in a group, starts 
social tasks promptly, initiates activities in free play periods, coéperates 
well on group projects, and usually directs them, and (2) the pupil who 
is likely to take orders without question, accepts group standards 
readily, and takes turns readily. This factor is probably best named 
“‘Leadership.”’ 

Factor 3 was found to be contained largely in situations III, IV and 
V and to be low in situations VI, VIII, and XII. It differentiates 
between (1) the pupil who conforms to the group more because he 
wishes to do so, is willing to sacrifice to conform to standards imposed 
by society, and wishes for social approbation, and (2) the pupil who is 
independent of approval, is self-confident in a group (though probably 
for a different reason from the leader’s) and is rather independent in 
his attitude toward society. This category might be called ‘Social 
Consciousness”’ or ‘‘Group Consciousness,” ‘‘Social Conformity”’ or 
Gregariousness,’’ and the opposite aspect ‘‘Self-Sufficiency”’ or ‘‘ Social 
Independence.”’ 

Three factors so closely described the trait configuration that it is 
doubtful whether a fourth factor with its attendant high probable 
error would have contributed anything worth while.! At present there 





1 At least this is true according to the method of factor analysis as described in 
the 1935 article. After this scale was published in the summer of 1935, Thurstone’s 
new book appeared, Vectors of Mind (published by University of Chicago Press, 
1935) in which a revised factor method was given. The matrix loadings of the 
present study were then revised according to the new method (Table V). This 
new method yielded four factors. Factors 1 and 2 corresponded very closely to 
those of the original method, and factor 4 corresponded closely to factor 3 of the 
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is no objective criterion available for determining exactly how many 
factors should be taken out. 

Intercorrelations.—Another method of obtaining items which have 
elements in common is by grouping a number together by ‘‘common 
sense’ (as had been done in the first form under headings such as 
Responsibility, Emotional Security, etc.) and correlating them with 
each other and with the total—in this case the sum of the score of the 
thirty-three situations. 

When this procedure was followed in the case of the items under 
Coéperation, it was found that the situations, which are now Situations 
I and II in the final scale (see Table 1), correlated better with the six 
situations grouped by judgment under Coéperation than with the total 
scale. They were, therefore, considered to be fairly indicative of what 
is meant by Codperation. Situation III of the present scale was high 
with both the total scale and this grouping, and so was made an addi- 
tional item to be included with the other two. 

In the grouping of the original thirty-three situations, ten items 
were judged to be indicative of Emotional Security. Of these ten, the 
present situations V, VI and VII proved to be the three which had the 
highest degree of this element as indicated by the fact that the correla- 
tions were higher with this grouping than with the total scale. Under 
the grouping, Responsibility, the present situations XI, XII and XIII 
proved the best of the nine items. 

Since two of the items (situations III and V) chosen by this inter- 
correlation method had already been selected on the basis of the factor 
analysis, there was some overlapping. Situation III is repeated in the 
final scoring of each of the groupings, Coéperation and Social Conscious- 
ness, and Situations IV is repeated in the final scoring of each of the 
groupings, Social Consciousness and Emotional Security. 





former method with one difference. Situation V (Situation 15 on the experi- 
mental form) is not substantiated by the newer method since it changed from a 
matrix loading of —.24 to +.15. Factor 3, as described by the new method 
apparently is a factor which corresponds closely to the grouping called Responsi- 
bility. It verifies the choice of the items chosen as best under this heading, but 
also points out two other items (Situations 4 and 33) not under this heading that 
would have been as good or better. Since the method of factor analysis is being 
revised continually, the writer does not believe that the Winnetka Scale need be 
altered until further evidence presents itself. Although some of the matrix load- 
ings are not as large as one would wish, chiefly due to the high first factor, the most 
promising ones were chosen in each case. The length of the scale, the distribution 
of the ratings and the practical aspects of the situations had to be considered, as 
well as the fundamental one of statistical designation. 
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The Final Scale-——On the basis of these findings a final scale of 
thirteen situations was selected. The whole scale represents the first 
factor—school behavior—as previously described. The other two 
factors and the three groupings were made the basis for the division of 
the items into five headings of three situations each, (situations III and 
V being repeated in two groupings), as indicated in Table II.!. The 
graphic portrayal of the ratings of one child is shown by the profile in 
Fig. I. 

Reliability.—Eight teachers of grades from the kindergarten 
through the fifth rerated their pupils at intervals of two to eight weeks. 
Altogether there were two hundred one children rerated—forty-five 
in the kindergarten, fifty-eight in grade 1, fifty-two in grade 2, fifteen 
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Migure I. A profile graph of an individual rated on 
the Winnetka Scale for Rating School Behavior and 
Attitudes. 


Fic. 1. 


in grade 3, and thirty-one in grade 5. All of these ratings were made 
toward the end of the school year, the majority of them at the end of 
April and again at the close of school in the middle of June. Since 
many of these ratings were almost two months apart and since it is 
especially true in the lower grades that a child’s attitude may change 
considerably after having mastered the technique of reading toward 
the end of the school year, the coefficients given should be thought of as 
the minimum reliability of the rating scale. The Pearson r, when 





1 The situations which represent the negative aspects of those factors could 
have been included in this scale, with the scores reversed. However, since this 
scale was to be used by teachers in thinking about their children and possibly by the 
older children themselves, it seemed more meaningful to use the scores of those 
items which could be easily interpreted. 

Also, the final thirteen situations were analyzed for their weighting in both 
positive and negative aspects of factors 2 and 3. This was done so that the scale 
would not be overloaded with either of these factors. 
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Sheppard’s correction is used, is .87 for the entire scale. The corrected 
r for the five main groupings are as follows: Coéperation, r is .82, Social 
Consciousness, r is .80, Emotional Security, r is .79, Leadership, r is .72, 
and Responsibility, r is .82. 

Validity.—The final form of the scale (the selected thirteen situa- 
tions) was correlated with the Haggerty-Olson-Wickman Scale which 
had been rated at the same time for fifty-three cases. The correlation 
of the final form (the Winnetka Scale for Rating School Behavior and 
Attitudes) with Schedule A, Behavior Problem Record, is +.54. The 
r with Schedule B, Behavior Rating Scale is +.68. When ther is with 
the Emotional and Social Divisions only of the Haggerty-Olson Scale, 
the r is +.71, indicating a fairly substantial relationship between the 
two scales. The Haggerty-Olson Scale has been validated by means of 
clinical evidence and by subsequent school history of the pupils rated. 
The validity, if correlated with a pure criterion, would probably be 
higher than this because the errors in each scale tend to lower the 
correlation. Since there is no other criterion available, this is the 
best estimate. 

Scatter of Response Levels.—The response levels of the second experi- 
mental form had been arranged in a random order because it had been 
thought that a ‘‘good to bad” order might influence the teachers’ 
judgments and result in a narrow range of ratings. To see if this had 
been true, a study was made of the two types of order. The first 
experimental form had been arranged in the order of desirability of the 
response levels. There were four situations in the two forms which 
had remained exactly alike in a number of response levels, 7.e. no new 
response levels had been added in the revision. Six teachers had rated 
their classes during two successive years—the first year on the form in 
which the items were arranged in order of desirability, and the second 
year inrandom order. SD’s were calculated for all of the ratings of the 
four situations in the two forms. The mean SD of the order of desir- 
ability was 1.29 and of the random order was 1.25, indicating no differ- 
ence (see Table III). On the basis of this evidence the response levels 
were arranged in the order of desirability in the final scale. 


SIGNIFICANCE OF FINDINGS 


A rating scale which can be used from nursery school through the 
sixth grade is valuable for use in early diagnosis of disposition trends. 
It is especially important in the early years when habit patterns are in 
the process of formation. 
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The fact that medians are practically the same from year to year 
may indicate that the raters were thinking in terms of their own grade 
only and although actual growth may have recurred each teacher has 
his own level of expectancy for the children of that grade. 

The rating scale is a conduct scale. It summarizes action ratings 
and makes possible detailed ratings based on specific observations. 
The traits rated are summations of behavior defined in relation to the 
situations in which they occur. 


VALUE AND USE OF THE SCALE 


1. The scale is in the language of the environment in which it is used and 
in terms of situations in which the teachers are interested and which they 
most frequently note. 

2. The ratings help to make the teacher’s judgment analytical. 

3. The scale is specific in its description of situations and response levels. 

4. Its best use will come from judgments made after extended observation 
(at least two months’ knowledge of the child). 

5. The scale should be used for analysis of individual differences within 
a grade rather than for evaluation of differences between grades. 

6. The profile graph furnishes a means for seeing clearly an individual’s 
assets and liabilities among the five traits of Coéperation, Social Conscious- 
ness, Emotional Security, Leadership, and Responsibility. 

7. The scale is valuable in obtaining cumulative records of an individual’s 
behavior, in comparison with that of his group, from the nursery school 
through the sixth grade. 

8. The scale is a reliable instrument to use in research work where a 
general estimate of the pupils’ school behavior and attitudes is desired. 


SUMMARY 


The Winnetka Scale of Rating School Behavior and Attitudes! has 
been constructed by teachers of the Winnetka Public School faculty 
and the writer by analysis of actual incidents occurring in the class- 
room. The data were classified into situations and response levels. 
The response levels were judged in order of desirability by a large group 
of teachers. 

One form of the rating scale (consisting of thirty situations) was 
used in the rating of two hundred fifty children. The second form 
(consisting of thirty-three situations), revised on the basis of the previ- 





1 Published by the Winnetka Educational Press, Horace Mann School, 
Winnetka, Ill. 
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TaBLeE [.—Distripution, DeciLE ScORES AND GRADE MEDIANS OF THIRTEEN 
SITUATIONS SELECTED FOR FINAL SCALE 







































































N = 1128 
Medians! 
} 
Per- N.S.| Kg] 1 2|3 | 5] 6 
centage | Decile 
of score 
N N 
24 | 105| 174| 166/ 179) 155| 161] 58 
Situation I.—When taking turns with apparatus or materials or in a group discussion. 
10 10 Waits patiently for aturn........... 
4l y Takes turn willingly................ —— | > .. |Mi|MiMIM 
20 5 Needs occasional reminder to be pa- 
I hala: acs bid Ane muna tae aaa M - M 
12 3 Is too patient—does not assert himself| .. ee ee M 
7 2 Is impatient while waiting turn...... 
5 1 Is unwilling to wait turn............ 
5 1 Is unwilling to wait turn and interferes 
with other children’s activities....... 
Situation II.— When there is a group project to be carried out. 
39 10 Enjoys coéperating with others to 
improve the group work............ Te ere Gee en fre een eee 
36 6 | Codperates willingly with others..... MiMiMi{MiMiMi{M 
16 3 Is slow to coéperate................ 
3 1 Does not coéperate with group....... 
3 1 Withdraws from group activity and 
carries on non-valuable activity..... 
3 O | Hinders group activity....... 



































Situation III.—When faced with a social situation involving sacrifice of own interests or needs to 
those of group. 


13 10 Puts group needs before own needs... 
26 9 Helps group when own work is done 
ee 
19 6 Does own work before attending to 
schoolroom jobs or helping other 

te ic cen eea eke enews MiMiMI|MiM|{|MiM 
9 4 | Gives time and thought to others to 
' harm of own achievement.......... 

23 3 Follows own interests............... M 

6 1 Thinks only of own immediate satis- 
TE Tr Pee eer re Peer 
a 0 | Follows own interest to the point of 
being disturbing to the group....... 



































1 These medians represent one thousand twenty-two Winnetka children. The fifty-one Chicago 
Emergency Nursery School children and two ‘‘ mixed grades,”’ the 3-4 grade (twenty-four children) 
and the 4—5 grade (thirty-one children) in the Winnetka Public Schools have been omitted from 


these calculations of the medians. They were used for the decile score, however, hence the N is 
one thousand one hundred twenty-eight. 
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TaBLE I.—(Continued) 



































Medians 
Per- 
cutents thesis ns{Ke|1 [2] 3]4|s5|o 
of score 
N N 
24 | 105 | 174| 166 | 179| 155 161| 58 
Situation IV.— When a child has a social task to be completed. 
9 .10 | Carries task to completion even by 
sacrifice of other interests.......... 
31 9 | Carries task through by steady effort 
even though it does not harmonize 
with special interests............... M 
32 6 | Carries task through only when it 
does harmonize with special interests} M | Mji|MiM{|M!|M 
15 3 | Carries task through although applica- 
Ee eee 
8 . 1 | Drops task—loses interest quickly... 
5 1 | Tries to escape task by contrary be- 
y havior or by shifting jobs.......... 
Situation V.—Emotional Tone in School. 
8 10 | Is happy and not easily downed; 
enjoys work as much as play....... 
31 9 | Shows even, cheerful disposition—is 
EEE PEER re ore ara te eee M 
33 6 | Does not show an unusual amount of 
EE er re M MiMiM{|M M 
7 3 | Is over-serious and conscientious... .. 
u 2 | Does not take things seriously enough 
9 1 | Shows extreme amount of changeable- 
EEE ee re ene 
3 © | Is sullen or irritable................ 
Situation VI.— When there is a chance to go to adults for help or approval. 
30 10 | Shows satisfaction in own ability 
without being dependent on adult 
ESS ohare eee 
42 7 | Shows satisfaction in own ability but 
needs some adult approval......... MiM{Mi!M!s|MiM™M M 
13 3 | Does not seem to get satisfaction in 
his own ability or to recognize it 
without adult approval............ 
& 2 | Bids for approval—for example, shows 
work to adult for praise............ 
4 1 Acts only when adult gives approval 
EE a a ee 
3 0 | Bids for help (whines, cries, com- 
plains, stalls, etc.), until he realizes 
help is not furthcoming........... 
Situation VII.—When faced with failure. 
12 10 | Sees cause of failure and corrects it... 
22 9 | Tries to get help to overcome difficulty M 
8 7 | Recovers quickly and plans new 
ES is xh nd Ge ee a 6 35deda ke Ss M 
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TaBLE I.—(Continued) 



















































































Medians 
oe |. | 
Per- N.S|Ke} 1 | 2/3]4 | 5] 6 
centage | Decile | | | 
of score a 
N N 
| | | 
24 | 105 as 166 155 | 161] 58 
| 
Situation VII.— When faced with failure. (Continued) 
25 6 | Shows disappointment but continues 
EFI SLANE EP Mi..]M{Mi|Mi|MIM 
4 3 Is apparently indifferent to failure. ... 
21 2 Becomes discouraged easily—must 
succeed in order to continue activity 
3 0 | Becomes irritable or angry, or cries... 
Situation VIII.—When in an organized group with teacher present. 
19 10 | Is able to lead a group without being | 
nervous or embarrassed............ 
8 8 | Leads group in spite of being nervous 
or embarrassed.................... 
17 Me ok wo i a as Be ya 
31 6 | Does not lead group but is confident 
in dealing with individuals......... MiMiM]|..|/M!i|M{M 
6 3 | Tends to be shy with adults but not 
ER TE ne ee 
1 2 | Tends to be shy with children but not 
aici ar ad sate bie tec nse Sa 8 ang 
18 2 Is shy with both children and adults. . 
Situation IX.—When child has opportunity to take responsibility for a group task. 
15 10 | Directs task and carries it to comple- 
tion for group benefit.............. 
15 9 | Takes responsibility for a task without 
EE EE ee ee 
10 7 | Takes task but does not complete it..| .. | .. | .. | .. | .. | .. | ww | M 
20 6 Takes responsibility for task only 
when especially asked by teacher..... M| Mji..|M{|M{M {MM 
7 4 | Takes responsibility for a task only | 
when special interest is involved....| .. ee 
14 | 3 Rarely wants to take charge of task. . 
19 2 Cannot take responsibility for a group 
dS enka ed Les betta Kaldiaciik wie 40% 
Situation X.—When in a social situation which allows for initiative. 
10 10 Can organize and lead large group... 
28 9 Can organize and lead small group... 
14 6 Can lead another child..............| M | M]| .. MiM 
17 5 | Takes good care of self but does not 
attempt to lead others............. es a) | ee .. |Mi MIM 
1 3 Does not like to have others take the 
lead and clings to own ideas........ 
6 3 | Bothers other children or bosses them | 
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TABLE I.—(Continued) 



































Medians 
Per- ; ns{ke]1|2|3[4[5|o 
centage | Decile 
- score N 
24 | 105 | 174| 166 | 179| 155| 161| 58 
Situation X.—When in a social situation which allows for initiative. (Continued) 
2 _ 2 Allows other child to boss him in a 
way that is harmful to himself or 
a ati ak lh hac rch ena iatl Se ah bes 
9 2 | Shows cruel tendencies, such as bully- 
ing (bossing weaker child), ridiculing, 
SG eA dah ie a Nao oe hk ee 
2 EET TE Te ee 
11 1 Shows no social initiative............ 
Situation XI.—When he has finished studying a subject. 
11 10 | Has time so planned that he knows 
f what work to do next.............. 
27 9 Starts new work without reminder...}| M | .. | .. on - - hae 
25 6 | Starts new work but needs help of 
teacher in planning................ MiM;|;M;Mi|MiM 
3 4 | Starts new work but gets other chil- 
ee 
11 3 | Begins something other than what he 
RE ae ne ae ae eee 
9 2 | Wanders around aimlessly or sits in 
seat day-dreaming................ 
14 1 Wanders around room annoying other 
children or sits in seat bothering 
ES a 
Situation XII.—When he can get help from adult. 
17 10 | Tries hard by himself before he will 
ask for help or makes own plans— 
does not need help................. 
38 8 | Asks only for necessary help......... MiM;|Mi{iM{|M™M MiM 
15 5 | Neglects to ask for help when he really 
CS nas Wad n'a Mae ae ba 8 < eA M 
20 3 Depends upon help being given...... 
5 1 Asks unnecessarily for help.......... 
5 1 Helps self only when urged.......... 
Situation XIII.—When things must be organized for work. 
36 10 | Gets things he needs together ahead 
of time so that work goes smoothly. . 
21 6 | Careful but slow in getting things 
EE ee eee M!i!iMi..|M;M{Mi{|M{M 
17 4 | Careless in getting things together... M 
19 3 | Only gets things as needed.......... 
7 1 Waits for others to get things he needs 
eth uhna de wbeea hs eb oes 
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TaB_LeE II.—Grovup or THE THIRTEEN SITUATIONS UNDER THE FrivE HEADINGS 


CodpPERATION LEADERSHIP 
5, a-4%h-ao psbce doo ale Cee DRC. i-a-s-44 oe eae ek edh eke VIII 
Nis 22 36 we dey oe aiad hh ate a II Situation...................... IX 
ae eee ey ee MS ve canciwee sad aseens xX 
Socrat CONSCIOUSNESS RESPONSIBILITY 
I RR RS re Ie III Situation...................... XI 
I aks 4: 6-4 ewrena-en4e.ee a IV Situation......................, XII 
ARES eA eS V Situation...................... XIII 
EMOTIONAL SECURITY 
NS eo 5: die ase & oe abe Kee V 
En capa wh ee oak es ee OOO VI 
Situation....................... VII 


TaBLE III.—CompPaRIson oF SCATTER OF RATINGS 
Mean SD’s of Four Situations Having Same Number of Items in Each Form, 
Mean of Six Raters 














Form I, items arranged in—order of desirability Form II, random order 
| 

’ ' Mean Mean . Mean Mean 
Situation rating SD N rating SD N 
A.. 4.7 99 | 154 5.0 | .87 | 145 
as oben anes ene ows 3.5 1.14 156 3.3 1.12 145 
a ts nag ee ede ieee ewe 4.2 1.47 156 4.3 1.32 145 
D.. 5.0 1.55 136 5.0 1.99 138 

re ee oes ee esacteh 4.4 1.29 sea 4.4 1.25 
Nd oo ne eee 5 ti 6 anos 602 ale i it 573 























ous trial with the first form, was used in rating over twelve hundred 
children. Decile scores were calculated from the eleven hundred 
twenty-eight ratings made on Winnetka children. Medians for each 
grade were found to be practically the same, indicating an expectancy 
level for each grade. 

The multiple factor method was applied to the intercorrelations of 
the thirty-three situations. By means of this method and intercorrela- 
tions of certain groupings, five categories resulted. The five categories 
have been named Codperation, Social Consciousness, Emotional 
Security, Leadership, and Responsibility. A minimum reliability was 
determined by correlation of two ratings made by the same teachers 
at intervals of two to eight weeks. The minimum reliability r for the 
complete scale is +.87. The 7r’s for the five main groupings range 
from +.72 to +.82. Validity was obtained by correlation with the 
Social and Emotional Divisions of the Haggerty-Olson-Wickman 
Behavior Rating Scale. The validity ris +.71. 
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The final scale consists of thirteen situations and their response 
levels. It is usable as a scale for rating school behavior and attitudes 


TaBLeE IV.—Matrix oF Factor Loapincs! 








Number in Number| First | Second| Third 

experi- in final | factor | factor | factor 
t mental form scale | loading} loading | loading 

oo. Coéperation with adult 

: 1 el Ls ba een eeeteswee sone +.48 | +.06 | —.04 
2 Independence of work..................-00055 +.62 | +.07 | —.05 

3 Compliance with teacher's requests............. +.53 | +.18 | —.06 

4 Helping teacher with group.................... +.63 | +.01] +.22 

5 Actions when visitors present.................. +.44/ +.14 | —.08 

Emotional! security 

6 Independence of adult approval................ VI +.62 | —.08 | +.29 
7 Application to academic task.................. spat +.64 | —.01 | —.81! 

8 Application to social task...................... IV +.60 | —.09 | —.22 
9 Ro vcccdcenn es condenseunbse +.44) +.18 | —.263 

10 Réaction to interference....................2-. ican +.51] —.06 | +.16 

11 ak lh we eek wah dee Ble aie VII +.64 | —.03 | +.01 
12 Degree of sociability....................00e06. +.36 | +.13 | —.33! 

13 Degree of adaptability........................ wad +.66 | +.05 | —.19 

14 Degree of self-confidence in group.............. VIII +.35 | —.39 | +.36 

15 RRR Eh a ee ee V +.62 | +.12 | —.24 

Codperation 

16 Acceptance of group standards................ see +.49 | +.67 | +.09 

17 EET EE Pe x +.62 | —.46 | +.15 

18 Coéperation in group project.................. II +.67 | —.07 | —.04 

19 kee ice han sso eee ange Wa We I +.51] +.30 | +.04 

20 Helping another individual.................... Ow +.59 | —.23 | —.19 

21 ee a oa eee ae III +.73 | —.10 | —.31 

Responsibility 

22 Organization of materials for work.............. XIII +.72 | +.24 1] +.06 

23 EE EE jaan +.63 | —.14] —.06 

24 Independence of adults........................] XII +.64 | +.06 | +.14 

25 EE EO DP EE eS +.41 | +.38 | +.04 

26 Promptness of starting academic task........... +.76?} +.03 | —.07 

27 Promptness of starting social task.............. ape +.51 | —.407] —.02 

28 Going from one academic task to another....... XI +.74)] +.34] —.05 

29 Going from one social task to another.......... saa +.72 |} —.04} —.01 

30 Direction of group tasks...................00. IX +.60| —.25 | +.05 

Individual initiative 
ge 31 Initiative in free activity period................ +.55 | —.34 |] —.02 
ri 32 | EE eer ny +.68 | —.32] —.12 
i ¥. 33 Work without supervision..................... +.71 | +.24 | +.17 
ae 




















1 As calculated by the method described in article by Thurstone in Psych. Rev., Vol. XLI, 
Jan., 1934, pp. 1-32. 


. 3 2 Omitted because of poor distribution of ratings. 
: * Abnormally high because this item was used as the pivot test. 


from nursery school through the sixth grade. It is a conduct scale and 
is best used after an extended period of observation. 











Scale for Rating School Behavior and Attitudes 


TaBLE V.—Matrix or Factor Loapinc! 











Number in ex- | Number in | First factor | Second fac- | Third factor| Fourth fac- 
perimental form| final scale loading | torloading| loading | tor loading 
1 + .48 +.10 —.11 — .06 
2 + .63 + .08 — .07 —.12 
3 + .53 +.13 + .05 — .16 
4 + .64 — .20 + .29 + .13 
5 se + .45 + .20 + .08 — .29 
6 VI + .62 +.12 — .23 — .20 
7 eer + .64 +.12 — .09 + .57 
8 IV +.61 +.11 —.19 +.19 
i) + .45 + .33 — .07 — .21 
10 Sales + .52 + .02 — .09 —.11 
11 VII + .64 —.12 +.10 — .22 
12 + .37 + .27 — .16 —.21 
13 iced + .66 + .09 +.01 — .16 
14 VIII + .36 — .35 — .27 — .25 
15 V + .63 +.21 — .13 —.15 
16 pada + .47 + .49 + .36 + .05 
17 x + .63 — .37 — .28 — .05 
18 II + .67 +.10 — .32 + .06 
19 I +.51 + .36 + .06 —.14 
20 wil + .59 — .30 .00 — .02 
21 III + .73 — .09 — .02 +.15 
22 XII +.72 +.10 +.19 +.17 
23 chan + .63 + .03 — .24 +.11 
24 XII + .65 .00 +.17 + .09 
25 +.41 + .27 + .33 + .23 
26 + .76 —.13 + .35 —.12 
27 Sains + .51 — .41 — .21 +.15 
28 XI + .74 —.10 + .37 + .03 
29 paee + .73 —.19 +.18 +.10 
30 IX + .60 — .30 —.10 + .06 
31 + .56 — 31 —.10 + .08 
32 + .68 — .36 — .07 + .07 
33 +.71 + .12 + .24 +.17 




















1 As calculated by the method described in article by Thurstone in Psych. Rev., 
Vol. XLI, Jan., 1934, pp. 1-32. 
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A MATHEMATICS VOCABULARY TEST AND SOME 
RESULTS OF AN EXAMINATION OF UNIVERSITY 
FRESHMEN 


A. 8. EDWARDS 
University of Georgia 


In the course of a study of the causes of failure of students in the 
University of Georgia, several lines of evidence led to the conclusion 
that one of the most important defects of these students is the lack of 
knowledge of English words and their meanings. The examinations, 
psychological and English, which gave this indication, were for the 
most part non-mathematical. It occurred to the writer that little 
is known about the students’ knowledge of mathematical words, 
signs, etc., and that a test should be devised to give some accurate 
indication of the mathematical vocabulary of students entering the 
university. 

In selecting the items for the test the present writer listed about 
one hundred seventy-five symbols, technical terms, etc., representing 
arithmetic, algebra, and geometry, and with the assistance of Dean 
R. P. Stephens, head of the department of mathematics, eliminated 
enough to leave one hundred items.! 

For various reasons, to expect students to be able to word defini- 
tions on this examination was considered too difficult; the matching 
test eliminates this form of answer and it was considered most useful 
for our purpose. The identification was made by means of numbers. 
Parts of the examination are shown below. 


SAMPLES OF TEST 


Directions.—In the space provided at the right of each symbol, write the 
number of the word or phrase which identifies that symbol. 











1. angle oa 
2. circle + 
3. congruent rs 
4. cube root “(eee 
5. divided by , 
6. equal to > 
7. greater than pee Se 





1 The writer also wishes to acknowledge the help of Mr. C. M. Cox, and Miss 
Olive Eagen, of this University. 
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8. identical with , 
9. infinity Dacca 
10. less than A. 





Directions.—In the parentheses after each definition place the number of 
the term which is defined. 


1. binominal 6. factor 11. proper fraction 

2. common factor 7. improper fraction 12. quadrilateral 

3. decimal fraction 8. integer 13. rational number 

4. equation 9. monomial 14. transposition 

5. exponent 10. polynominal 15. variable 

A figure formed from four straight lines........................... = 
A statement which expresses the equality of two algebraic expressions (_ ) 
A fraction in which the numerator exceeds the denominator........ t J 
A quantity which can be divided into each term of a series of terms.. (_ ) 
An algebraic expression containing two or more terms.............. fa 
A real number which can be expressed as the quotient of two integers. (_ ) 
An expression consisting of a single term......................... ( ) 


In the spring of 1935 the test was given to 166 freshmen who had 
taken one or more courses in mathematics in the university. The 
results of this examination are given in Table I. 


TABLE I.—-REsvu.ts oF MATHEMATICS VOCABULARY TEsT GIVEN TO 166 FRESHMEN 
Wuo Hap TAKEN ONE oR MorE Courses IN MATHEMATICS DURING 
THE First YBAR IN THE UNIVERSITY 








Men, 108 cases Women, 58 cases 
SESS ee ee ae One 61 63 
No 9164s 6.o.4 0006 440455 Kew RS 71 79 
| TERRE ESE ees Paneer Sera re a> 81 86 
a dag oho Cora etl al 12-100 34-92 











In the fall of 1935 the test was given to six hundred sixty-five 
freshmen at the opening of the University and before they had started 
any class work in the University. The distribution of scores, medians, 
Q;, Q3, and the extremes are given for these students as a whole and 
for the students divided into eleven-year men, eleven-year women, 
twelve year-men and twelve-year women. ‘This refers to the amount 
of training received before entering the University. See Table II. 

The students who had had some mathematics in the university 
and whose results of the vocabulary test are shown in Table I, did 
considerably better than the entering freshmen. Comparing the 
former with the latter group by means of the percentile ranks for the 





oo eine ge nee emi 
“ = 


696 The Journal of Educational Psychology 


freshmen, it is found that the median for the men, namely, seventy-one, 
has a percentile rank of seventy-four; the median of seventy-nine for 
the women, has a percentile rank of eighty-seven. If it can be assumed 
that the vocabulary of the students represented in Table I was about 
the same as that of the freshmen represented in Table II, and this is 
not a very safe assumption, then it can be said that the mathematics 
vocabulary has evidently increased very considerably because of 
university work in mathematics. 

As to whether or not the vocabulary in mathematics of the fresh- 
men reported in Table II is adequate or inadequate can hardly be 
said. Whether or not it is would compare favorably or unfavorably 
with freshmen in other institutions can only be guessed. 


TaBLE IJ.—DistTrRIBsvuTION MATHEMATICS VOCABULARY TEST 














Group Cases | Median — Q: | Qs | Extremes 
All students....................| 665 | 59 50 45 |72 6-100 
Eleven-year men............... 348 | 57 47 42 |73 13-100 
Eleven-year women.............| 162] 58 48.5 |42.5/69 10-92 
Twelve-year men............... 93 | 68 67 55 .5|76.5 1-98 
Twelve-year women............. 62 | 59.5 52 41.5|74.5) 17-92 

















The relation between the mathematics vocabulary test and the 
results of other examinations are shown in Table III. This table 
gives correlations between the mathematics vocabulary test and the 
following tests: psychological examination (A.C.E.), the mathematics 
survey test, and the English placement test. The PE’s are all low 
and the correlation coefficients are between .47 and .84. These are 
definitely significant of relationship. Correlations with the science 
test give somewhat different results: .23 to .63. 

Measure of reliability as indicated by the correlation of odd-even 
numbers in the examination is given as follows: one hundred sixty 
cases were selected by taking every fourth case. The correlation 
between odd-even scores on the examination is .92, PE. .008. 

The correlation between the Mathematics Vocabulary Test and the 
final examination in mathematics for the fall quarter was determined 
for two groups of students. For one group, numbering one hundred 
forty-one cases, the correlation was. 638, PE .033. For the second 
group, numbering one hundred eighteen students, the correlation was 
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TaBLE III.—CoRRELATIONS BETWEEN MATHEMATICS VOCABULARY TEST AND 
Four OTser Tests, NAMELY, MatTuematics Survey, A.C.E. 
PsYCHOLOGICAL EXAMINATION, ENGLISH, AND GENERAL SCIENCE 





Eleven-year | Twelve-year 





Men, | Women,! Men, | Women, 
348 cases! 162 cases; 93 cases | 62 cases 





r |PE| r | PE| r | PE! r | PE 











Mathematical vocabulary and mathe- | 














re . 743) .017|.71 |.026) .54) .05).68 |.05 
Psychological examination............. .75 |.01 |.58 |.035) .55) .05).85 |.02 
RE eee rr .615) .04 |.476).04 | .49| .07|.725).04 
NN occ ccvecceveseeashaues .49 |.04 |.59 |.04 | .23) .07|.63 |.054 

















.59, PE .04. These results show a considerable influence of vocabu- 
lary upon achievement in mathematics. 


CONCLUSIONS 


Results of a first attempt at making a Mathematics Vocabulary 
Test are reported. It is apparent that many entering freshmen do 
not have a sufficient mathematics vocabulary to understand all that 
might be expected of students doing work in courses in mathematics. 
The vocabulary evidently increases considerably as students take 
courses in mathematics in the University. It is doubtful if some of 
the students are able to read the mathematics texts with proper 
understanding because of lack of knowledge of the terms used. Corre- 
lations give weight to the hypothesis that a considerable amount of 
difficulty in mathematics is due to lack of sufficient knowledge of the 
mathematical terms used. 
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THE EFFECT OF DO-NOT-GUESS DIRECTIONS UPON 
THE VALIDITY OF TRUE-FALSE OR MULTIPLE- 
CHOICE TESTS 


DAVID F. VOTAW 
Southwest Texas Teachers College 


Is the validity of a true-false or multiple-choice test improved by 
extending to a student the liberty of refusing to respond to an item 
when he believes his answer would be a guess? The makers of many 
published tests evidently believe validity is thus improved, for many 
of these tests direct students at their discretion to omit items. Studies 
available to Ruch! in 1929, most of which employed correlation 
techniques, led him to make the following summarizing statement on 
chance and guessing in tests: ‘‘The available evidence suggests that 
both more valid and reliable scores are to be obtained by instructing 
pupils to omit items where the answering is nothing more than a sheer 
guess.” 

Now, it may be stated definitely that if all students working under 
do-not-guess instructions actually rejected all those items and only 


those items to which their answers would be sheer guessing, the 
W 


formula S = Rk — st (S = score, R = number right, W = number 
wrong, and nm = number of responses in each item) would yield the 
same score that would be obtained in the event that all items were 
attempted. No advantage would appear, therefore, in validity or 
reliability from the do-not-guess directions. 

In actual operation, however, do students follow uniformly the 
directions? Is it possible that different temperaments may react 
differently to the directions? Is it possible that varying degrees of 
scholarship may influence differently the decision to reject items? 

In an effort to answer these questions, one hundred and twenty- 
nine students (seventy-one men and fifty-eight women) in a course 
designated as Principles of Secondary Education, at Southwest Texas 
Teachers College, were given a test consisting of sixty true-false items 
and fifty multiple-choice items of four responses. The items had 
previously been validated by J. Erle Grinnell? and the writer. The 
reliability coefficients of the two sections were .79 and .80 respectively. 





1 Ruch, G. M.: The Objective or New-Type Examination. Chicago: Scott, 
Foresman and Co., 1929, p. 356. 
2 Stout Institute, Wisconsin. 
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Effect of Do-not-guess Directions 


The tests were administered by the writer as follows: 


1. The students were directed to omit items if the answers would repre- 
sent sheer guessing. Sufficient time was allowed for all items to be attempted. 

2. Papers were collected. 

3. Students were supplied with red pencils. 

4. Immediately papers were returned and students were directed to answer 
in red all items which they had previously omitted. 

5. A few days later, as regular work of the course and without any state- 
ment of connection with the first test, the students were given a scale for 
measuring ascendance-submission in personality.! 


For the subject-matter test two scores for each student were 
determined—a score for the do-not-guess situation and a score for the 





guessed items. The formula S = Rk — was used for both 


n=-l 
scorings. Each student’s score on the guessed items represents the 
gain (or loss) which accrued to him from having been denied the privi- 
lege of rejecting items. Comparisons of gains were then made on 
two bases; viz., two widely spaced groups with respect to ascendance- 
submission traits and two widely spaced groups with respect to 
scholarship (final marks in the course). 
Ascendance-submission categories are: 


1. Ascendant—the twenty-seven per cent nearest the ascendant end of 
the scale. 

2. Submissive—the twenty-seven per cent nearest the submissive end 
of the scale. 

3. The middle forty-six per cent 


The scholastic categories are: 


1. Upper twenty-seven per cent. 
2. Lower twenty-seven per cent. 
3. The middle forty-six per cent. 


Independence of these two variables is suggested by a contingency 
coefficient of .10 + .06. 

Although not significant, the difference being less than three prob- 
able errors, in the case of the true-false test comparing ascendance- 
submission some importance should be attached to the fact that the 
difference is in the same direction as it is in the multiple-choice test 





1 Allport, G. W. and Allport, F. H.: A-S Reaction Study. Boston: Houghton 
Mifflin Co., 1928. (A separate eight-page test booklet for each of the two sexes.) 
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TaBLE I.—CoMPARISON OF ScorRES MADE ON GUESSED ITEMS BY GROUPS 
CLASSIFIED AS ASCENDANT OR SUBMISSIVE IN PERSONALITY AND AS HIGH 
oR Low In SCHOLARSHIP 
(N = 35 for Each of the Eight Groups) 








Type of test Classification of students Mean! + PE?| Difa | PEp | Dif/PEp 
True-false. Ascendant twenty-seven per cent...| +1.17 + .33 
Submissive twenty-seven per cent..| +1.78 + .41 .61 . 53 1.15 





Multiple-choice. | Ascendant twenty-seven per cent...|/ +1.65 + .19 
Submissive twenty-seven per cent. .| +3.22¢ + .37| 1.57 .42 3.74 








True-false. Upper scholarship twenty-seven per 
Oe i has oe ah ad os 8 +2.87¢t + .31]| 3.74 47 7.96 
Lower scholarship twenty-seven per 
Dadi sieaseeeberk eee cvenen —0.87 + .35 
Multiple-choice. | Upper scholarship twenty-seven per 
Ce Ste Eee tae aw awaN +2.65f + .32] 1.43 .38 3.76 
Lower scholarship twenty-seven per 
" ery ene aaa wae ee sora WR +1.22 + .20 




















1 In this table and in subsequent tables the dagger (ft) is used to mark means or proportions 
which are significantly higher than their companion figures. 

2 Computation of the probable errors of the means was by use of the formula for samples from a 
restricted parent population, PE = ye/1 — p (p being the sample's proportion of the parent 
population, *%g9 in this instance). 

The submissive personalities appear to profit more than ascendant 
personalities by being required to answer all items. 

The fact that good students profit more than poor students by the 
requirement that all items be attempted is not proof of greater relia- 
bility for the directions to answer all items. Constancy of individual 
reactions to number of items rejected and to proportion which would 
have been answered correctly would preserve reliability but not 
validity. 

Although the questions raised at the beginning of this paper have 
been answered largely by Table I, it should be of interest to investigate 
the conditions which produce these differences. Do submissive 
students and high scholarship students make their gains by answering 
correctly a greater proportion of their guessed items or by leaving more 
items to be guessed? Table II attempts to answer this question. 
The ascendant-submissive part of the table furnishes no definite proof 
that the submissive group answers a greater proportion of guessed 
items correctly, although the differences in both types of tests are in 
the same direction, slightly favoring the submissive to guess right. 
The lower part of Table II leaves practically no doubt that in the true- 
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false test the upper group will do better than the lower in answering 
correctly items at which they admittedly guess. Evidence of this 
trend is much weaker in the case of multiple-choice questions although 
it should be noticed that both of the differences favor the upper group 
to guess right. 


TaBLB II.—CoMPARISON OF PROPORTIONS OF WRONG ANSWERS MADE ON GUESSED 
IremMs BY Groups CLASSIFIED AS ASCENDANT OR SUBMISSIVE IN PERSONALITY 
AND AS HIGH OR LOW IN SCHOLARSHIP 








Num- | .. Pro- 
Type of : . ber of Num- por- : . 
aaa Classification of students ouneued ber ei PE, | Difp| PEp | Dif/PEp 
; wrong 
items wrong 
True- Ascendant twenty-seven percent..| 365 162 | .444 | .017|.006| .023 . 26 


false. Submissive twenty-seven per cent.| 462 202 | .438 | .016 





Multiple- | Ascendant twenty-seven percent..| 449 294 | .656 | .015|).054) .021 2.57 








choice. | Submissive twenty-seven per cent.| 561 338 | .602 | .014 
True- Upper scholarship twenty-seven 
false. EE ey eee 378 140 | .370 | .017 
Lower scholarship twenty-seven 
eee 411 221 .537T| .017|.167| .024 6.95 
Multiple- | Upper scholarship twenty-seven 
choice. RE PCARES AERO yee ee te 558 350 | .627 | .014 
Lower scholarship twenty-seven 
i a aicacn aah See aera 449 296 | .659 | .015)|.032)] .021 1.52 





























Since Table II has not provided an explanation for the greater gains 
of the submissive students and has provided only a partial explanation 
for greater gains of upper scholarship students, Table III is presented 
tosupplement Table II. Tables II and III viewed together reveal that 
submissive students gain their advantage not so much from answering 
correctly a greater proportion of guessed items when working under 
directions to answer all items, but more from having been prevented 
by such directions from leaving a large number of items unanswered. 
The fact that the submissive (shrinking violet) type of student will 
‘“‘pass up”? many more items than will his ascendant (bold) brother 
clearly places him at a disadvantage when do-not-guess instructions 
are given, even though both may be able to guess the same proportion 
of correct answers provided that proportion exceeds chance expectation. 

The upper scholastic student shows greater gains on guessed true- 
false items than the lower student because, even though he may guess 
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at slightly fewer items, he will get a far greater proportion right. T 
In the multiple-choice test, however, the upper student not only leaves 
more items to be guessed but he later succeeds in guessing a greater - 


proportion of them right. 











TaBLE III.—DistrieutTion or GuESSED ITEMS BETWEEN ASCENDANT AND 7 
SUBMISSIVE STUDENTS AND BETWEEN HiGH aNnp Low ScHOLARSHIP 
, STUDENTS 
oe Guessed items a 
I 
Type of 
pa Classification of students Actual | E*. | Dife|"PExsp-». | Dif/PE 
Num- pected 
Propor- 
ber propor- 
tions 
tions « 
oer t 
True- Total ascendant and submissive} 827 | 1.000 
false. Ascendant twenty-seven per 
ea a, ig i ye 365 441 .500 
Submissive twenty-seven per 
ea ax aisha tb xh clases nz ae 462 .559T . 500 .059 .012 4 92 





Multiple- | Total ascendant and submissive| 1010 | 1.000 
choice. | Ascendant twenty-seven per 








ak dx rahe dc Skee wrt aa 4 6 oe 449 444 . 500 
Submissive twenty-seven per 
! AS ot Se cane tore wa ae a se oe 561 . 556T . 500 . 056 O11 5.09 
Ht True- Total upper and lower........ 789 | 1.000 
| false. Upper scholarship twenty-seven 
I 58656 Ais ms deol a soe eG 378 .479 .500 
; Lower scholarship twenty-seven 
¢ PO GIDE. .cccsccccacsccceses 411 .521 . 500 .021 .012 1.75 
Multiple- | Total upper and lower........ 1007 | 1.000 
choice. | Upper scholarship twenty-seven 
“eee are ee 558 .554T .500 .054 O11 4.91 
Lower scholarship twenty-seven 
ee ee 449 .446 .500 


























a From the thirty-five ascendant students the twelve having the 
ate | lowest scholarship were chosen to compare with the twelve having the 
met highest scholarship taken from the group of thirty-five submissive 
ek students. The comparisons of these two small groups shown in 
it | Table IV indicate the high price paid by students combining low 
uth scholarship and ascendancy for the privilege of not being required to 


answer items by guess. 
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TaBLE I1V.—ComPARISON OF TWELVE STUDENTS COMBINING UppER SCHOLARSHIP- 
SUBMISSION AND TWELVE STUDENTS COMBINING LOWER SCHOLARSHIP- 
ASCENDANCE WITH RESPECT TO ScoRES MApbE on GUESSED ITEMS 








™ _ Classification of students | Mean + PE! | Dify | PEp | Dif/PEp 
True- Upper scholarship submis- 
false. Gi vstkocuvitiscciesrant See S 2st. .89 4.64 
Lower scholarship-ascend- 
 TPeTerere see ss .  - ae 





Multiple- | Upper scholarship-submis- 


choice. Pa | Ue lL .56 4.46 
Lower scholarship-ascend- 
in sccesceccewaneatewnclh Se. 2 ae 




















1 Because of the smallness of samples »/ N — 1 was used in the denominator of 
the Chi factor. 


SUMMARY AND CONCLUSIONS 


Within the conditions which circumscribe this study the following 
conclusions seem reasonable: 

1. Do-not-guess instructions on the multiple-choice test place 
ascendant students at an advantage over submissive students. There 
is a suggestion that the same statement may hold for the true-false 
test. In general the advantage appears to be the result of a tendency 
for ascendant students to omit fewer items than submissive students, 
both groups having the ability to answer correctly about the same 
proportion of omitted items. 

2. In taking either true-false or multiple-choice tests, upper scholar- 
ship students profit more than lower scholarship students by the 
requirement that all items be answered. 

3. Do-not-guess instructions assess a very severe penalty against 
students possessing a combination of low scholarship and ascendant 
personality. 

4. The provision that items may be omitted introduces into a true- 
false test or a multiple-choice test, in part at least, a measure of per- 
sonality traits; consequently the provision reduces the validity of 
such tests if they purport to measure knowledge of subject-matter. 
In administering such tests, therefore, instructions should be given to 
respond to all items. 
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A NOTE ON THE STANDARD ERROR IN THE 
CONTINGENCY MATCHING TECHNIQUE 


P. E. VERNON 


The Maudsley Hospital, London 


In a previous article! it was shown that the result of a matching 
experiment may be expressed as a modified mean square contingency 
coefficient and its standard or probable error. If ¢ elements are 
matched against ¢ other elements by n judges, so that the total number 


of matchings is nt or N, and if the proportion correct is S, then it was 
found that: 








= (St — 1)? ey (3 
C= Gay Fa VN 


where E (an arbitrary symbol) = 





(¢ — 1), (St = DIG — 1)? + 1) + AG = 1) = (St = 1) 
At — 1) + (St — 1 


The same formulae apply when unequal numbers of elements are 
matched. Thusift’ elements are matched against t other elements, the 
only alteration is in the value of N, which becomes nt’. Some evidence 
for the applicability of these formulae was adduced from statistical 
experiments with balls; and it has since been discovered in psychological 
experiments that the dispersion of matching ability among different 
judges conforms reasonably closely to the above formula for a,. 

The object of this article is, first, to examine what may be termed 
“‘the SE of the material’ as distinguished from the SE of the judges, 
and secondly to consider the effects of matching more than one set of 
material upon these standard errors. 

The SE given by the above formula refers primarily to the varia- 
tions in C which might occur if another group of n similar judges 
matched the same set of ¢ pairs of elements; it does not tell us what 
variations to expect if the same group of judges matched a different set 
of ¢ elements. Consider the analogy of a correlation coefficient 
between an intelligence test and scholastic grades among a group of 
college students: when the PE of r is small we know that another group 








1 Vernon, P. E.: “‘The Evaluation of the Matching Method.” J. Educ. Psy- 
chol., Vol. XXVII, 1936, pp. 1-17. 
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of students would yield much the same correlation if they took the 
same test and the same scholastic examination. But it does not follow 
that the same students would yield the same correlation should they 
take another intelligence test and a different scholastic examination. A 
statistical population of tests may not be comparable to a population of 
testees. In matching, however, the two populations—namely the 
material which is matched (the subjects’ “modes of expression’’) 
and the matchings themselves {the judges’ “impressions’’)—are, as it 
were, on a similar psychological plane. Hence we may tentatively infer 
that the random distribution both of judges and of subjects or material 
is the same. In other words, the same variations may be expected in 
the values of C when m different sets of material are matched by one 
judge as when one set of material is matched by n different judges. 


And the SE of the material should be given by a 
m 


It was not found possible to devise a statistical experiment for 
testing out this supposition, but two psychological experiments provide 
relevant data. In an investigation by Allport, Walker and Lathers,' 
seventy students wrote a series of English themes. They were divided 
at random into sets of five students, and eight pairs of themes by each 
student were matched (7.e. their common authorship was identified) by 
two judges. A student’s identifiability was determined by the propor- 
tion of times out of eight that his themes were correctly matched. 
Allport shows that the distribution of the 2 X 70 identifiabilities is 
fairly close to normal. Applying the chi? method, the present writer 
finds that the probability of fit of a normal curve is .85. From All- 
port’s figures it can be stated that his average judge when matching an 
average set of five pairs of themes would obtain 50.178 per cent cor- 
rect; this corresponds to C = 0.602, E = .6655. Since n = m= 1, 
and N = ¢ = 5, the expected SE of the material is .2976. When each 
of the 2 X 70 identifiabilities is expressed as a contingency, the 
empirical SE is .3041, which coincides very closely with the predicted 
figure. Again, if the identifiability of successive pairs of subjects are 
combined, so that n=1, m=2 and N = 10, the predicted and 
obtained values of the SE are identical, namely .2105. 

In an investigation by the present writer, four hundred ninety pairs 
of drawings were collected from elementary-school children, aged 





1 Allport, F. H., Walker, L. and Lathers, E.: ‘‘ Written Composition and 
Characteristics of Personality.”” Arch. of Psychol., Vol. XXVI, No. 173, 1934, 
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ten to thirteen years.! It was hoped to secure an unselected population 
of material, 7.e. a random sample of children; since the extremes were 
not represented, the age limits were intentionally made rather wide. 
Each child drew ‘‘a house” and ‘‘a man”’ on identical pieces of paper. 
These were shuffled and divided at random into seventy sets. Each 
set consisted of ten houses stuck onto a large sheet of paper and four 
loose men, or ten men and four houses. Half a minute was allowed 
(exclusive of the time for giving out.the sheets and recording the 
answers) for matching the four men with four of the ten houses drawn 
by the same child, or four houses with four of the ten men. Thus the 
experiment consisted of 70 X 4:10 matchings. The judges were 
twenty adults of average to superior intelligence. Each matched half 
the sets, and since the sets were distributed to them at random, we can 
regard them as ten judges who matched all seventy sets. The average 
proportion correct was 31.89 per cent which corresponds to C = 0.589 
and EF = .8269. Weshould expect the SE of an average set of material 


to be = = .4134. The empirical SE was calculated for each judge, 


and the average result was .4279. This is slightly higher than the 
predicted figure, but it is quite probable that the material was unduly 
heterogeneous. The age range of the children may have been too great, 
or the amount of concentration of the judges upon their matchings may 
have varied throughout the experiment. Thus we may claim that both 
these investigations yield satisfactory confirmation of the formula that 
we have proposed for the SE of the material. 

We must now consider: (a) The SE of one or more sets of material 
which are matched by more than one judge, and conversely, (b) the 
SE of one or more judges who match more than one set of material. 
We will call these SE’s on and o, respectively, and will deal first with 


g,. The basic formula for the SE of judges who match m sets of similar 


materials iso, = Fr = —— - This is likely to be correct when the 


different sets of material are independent, 2.e. when there is no correla- 
tion between the judges’ scores.” But if there is a perfect correlation 





1 The writer wishes to thank Dr. H. E. Field of the London Institute of Edu- 
cation for his assistance in obtaining the drawings. 

2In the writer’s previous article (cf. footnote p. 704) this formula was assumed 
to apply to the hypothetical experiments Nos. 2a, 3a and 3b, whereas the modified 
formula, given below, should have been adopted. It was, however, confirmed by 
the results of a statistical experiment, since the conditions of this experiment did 
not allow 7, to be greater than zero. 








A Note on the Standard Error 707 


between the several sets of material, z.e. if any one judge obtains 


E 
the same score in every set, then o, = —;=- In general the average 


V nt 


inter-correlation between the m sets, 7,,, will be small, but positive; 
o, Will therefore lie somewhere between the two values that we have 
cited. By applying Kelley’s formula for average inter-correlation 
(No. 171), it is found that 





o, = E,/* + fale — 2) 
nmt 





| This can be — by substituting 7,, = 0 or 1.0; it then reduces, 
| correctly, to Fr and faa 
| Empirical corroboration for this modified formula was obtained 
from the experiment with children’s drawings, already described, 
! though the number of judges was much too small for conclusive results. 
The total success for each of the twenty judges was computed, and the 
empirical SE of the twenty coefficients was .1186. EH = .8269, 
Fm = +0.0358, m = 35, N = nt’ = 1X4. Substituting, the expected 
g, = .1041; this differs from the empirical value by less than its own 
SE. Fuller confirmation must await an investigation where two or 
more sets of material are matched by a large, unselected, group of 
judges. 
It must be realized, however, that this formula does not give a true 
prediction of o,; for the calculation of 7, is itself based upon the SE’s 
of the judges’ scores. In other words, the probable variations in C 
(when more than one set of material is matched by more than one 
judge) can only be determined on the basis of the observed variations 
within the available experimental results. Thus in actual practice it 
e will probably be simpler not to attempt to give ¢, for all the sets of 
material, but only for the average set; since the latter can be directly 
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predicted by the formula 


E 
V ni 
4 The same reasoning applies to om, the SE of the material. If 
7, is the average inter-correlation of the judges in a population of 
material (i.e. the correlation between the ease or difficulty of all 
(n — 1) 
~ 
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becomes — if the judges are independent, and a if the correla- 


tion is perfect. Allport’s investigation of English themes provides 
a good illustration of the formula. The number of judges was two, and 
the correlation between their seventy scores was +0.28. Since 
E = .6655, and ¢ = 5, we should expect the SE of the material to be 
.6655+/1 + 0.28 


V2X1xX5 
judges for each of the seventy subjects are expressed as contingencies, 
the empirical SE is .2343, an almost identical figure. 

In the present writer’s investigation of drawings, the average inter- 
correlation of the ten judges within the seventy sets of material was 
+0.1688; and as EH = .8269,n = 10, m = 1,t’ = 4, the predicted c,, for 
one set = .2075. The empirical SE of the seventy results of the com- 
bined judges was .2254. The agreement here is not quite so striking, 
probably because of the undue heterogeneity of the material. But the 


difference between the two figures is hardly significant, since the SE 


of the empirical c,, = <i = .0190. 


2 X 70 
Again, however, our prediction involves a correlation which is itself 
dependent on the iobserved variations within the given material. 
As there does not seem to be any way of predicting the SE of material 
(whether of one set or more than one) which is matched by more than 
one judge, the best practical procedure will be to quote c,, for all the 


V mt 








= .2381. When the actual results of the combined 





sets matched by the average judge, 7.e. 


SUMMARY 


The contingency coefficient which expresses the validity of a match- 
ing experiment possesses two standard or probable errors, one referring 
to variations in the judges who match the material, the other to varia- 
tions in the subjects or sets of material which are matched. The SE 
of n judges who match one set of material, or of m sets of material 
matched by one judge, can be predicted by the same formula; and 
this formula is empirically confirmed by the results of large-scale 
experiments. But the SE of judges who match more than one set, or of 
material matched by more than one judge, can only be determined on 


the basis of observed variations within the available experimental 
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results (except in the hypothetical case where there is no inter-correla- 
tion between the sets, or between the judges, respectively). 

There is a further corollary to the above discussion: In order to 
obtain a reliable matching contingency coefficient, not only should the 
group of judges, but also the number of sets of material, be fairly 
large. Both nt and mt should, like N in a correlational experiment, 
amount to one hundred or more. 





THERE IS NO EDUCATIONAL PSYCHOLOGY 


P. F. VALENTINE 


San Francisco State College 


While yet a tender educationist, I discovered the S-R bond. At 
about the same time, I came into possession of the intelligence quotient, 
the educational age, the accomplishment quotient, the standard 
score, the curve of normal distribution, the standard deviation, and a 
number of other pegs upon which to drape my thinking. These 
acquirements were a great thrill—and a marvelous convenience. The 
valiant confusions of my amateur mind shifted like magic into an 
ordered, practicable system. My enthusiasm for the new enlighten- 
ment was only matched by my astonishment at the architectural 
simplicity of the whole business. I soon got everything figured out. 
“It works,” I exclaimed pragmatically—‘‘ therefore, it is true!”’ 

Let me take the IQ’s, ‘“‘stamp in’ the S-R bonds, measure the 
products, and manipulate the data, and I had it all. But the funda- 
mental thing was the bonds. They were the framework for the whole 
structure of learning; and unless you understood just how they went 
together, you were an uneducated educator. The idea was not hard to 
grasp, however, in an elementary fashion. The way I got it was to 
realize that the environment is full of a great number of stimuli to which 
we must learn to respond. The afferent nerves pick up the stimuli, 
and the efferent nerves carry the impulse to the structures that perform 
the reaction. The proper hook-up, once made, is strengthened by 
exercise. That establishes a neurone pathway all set to perform the 
trick whenever necessary. This reasonable and practicable scheme 
was, as I gratefully acknowledged, psychology’s gift to man. 

So concise a mastery formula proved not beyond the capacity 
of great numbers of other educators, as I soon learned. Indeed, I 
presently found that teachers of courses in education everywhere were 
forming the custom of devoting their first lecture to bonds. Thus a 
good foundation of educational psychology was assured, and the knowl- 
edge was spread throughout the profession. Usually, it went so far as 
to include the “laws of learning.”’ 

In consequence of my inquiring and studious disposition, I went 
much further into the subject. I learned how the first happy con- 
nection is made through trial and error, and how the mathematics of 


frequency gives advantage to the right response. I got on intimate 
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terms with the synapse and the reflex arc, and clearly grasped the 
likeness of our nervous equipment to a metropolitan telephone system. 
In due time I became acquainted with the salivary dog, and with baby 
Albert and his white rabbit, and from then on the conditioned response 
was an unfailing guide. In regard to some problems, it is true, I had a 
little trouble. How all the bonds got organized into complex behavior 
was a bit difficult; and I felt some confusion over such minor matters 
as purpose, will, reasoning, and insight. But I was careful not to 
betray my weakness in regard to these questions, and I knew that 
sooner or later I would worm the answers out of the authorities. 

One of the delightful features of my progress was the acquirement of 
a jargon that put me upon a footing of camaraderie with other intel- 
lectuals in the profession. Our common language created an 
atmosphere of esoteric fellowship and encouraged a healthy expansion 
of personality. Outsiders were amazed when the conversation began 
to scintillate with neurones, chain reflexes, substitute stimuli, associa- 
tive shiftings, common elements, spinal reactions, simultaneous associa- 
tions and modifiable connections. And wonder shone upon us when, in 
the ease of our sure knowledge, we tossed off the problems of learning 
with the law of this and the law of that, prefaced by the mystic phrase, 
“other things being equal.” 

Our masterful grasp of things, in those pre-integration days, rested 
fundamentally upon an educational psychology that we could visualize. 
Afferent and efferent pathways, synapses, connections through the 
cord, cortical associations, were realities that could be projected in the 
imagination in a linear arrangement. The phenomena of learning 
could be graphically conceived, like problems in descriptive geometry; 
and complex behavior was but a matter of codrdinates. This Euclidian 
design was completely satisfying. It was comprehensive and beautiful. 
But above all, it was amazingly convenient. At any rate, it would 
have been terribly inconvenient to ‘‘unlearn’’ it. Much easier was it to 
fit the perplexities of perception, instinct, insight, organization and 
transfer into the system than to suffer the system to be changed. For 
our educational psychology was the system, and the system was our 
educational psychology. 

A Cebu monkey robbed me of my bonds, and a cage of rats ate up 
my faith in specificity. For the monkey learned a trick with his right 
arm while his left was paralyzed; and when the paralysis was trans- 
ferred to the right, he did the trick with his left! Just like that. And 
the rats—Lashley’s rats—showed no respect whatever for reflex 
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arcs. They stubbornly refused to identify their learnings with cortical 
traces. And there were also those troublesome chimpanzees. I had 
hoped to forget them, and had sought refuge among those who skepti- 
cally inferred that there was something wrong on the island of Teneriffe. 
But the banana-seeking detours of those hungry animals obsessed me. 
Verily, a plague of rodents and simians had fallen upon my cherished 
S-R foundations! 

And then I came under the influence of alien propaganda, financed 
from Berlin. Gestalttheorie. I should have been warned against these 
foreign-made ideas. But it was too late. Perhaps the blood of my 
sturdy American ancestors had gotten thin in my veins, for I began to 
succumb to the radical Teutonic literature. Finally my young mind 
was seduced by certain college professors who were teaching thinly 
veiled versions of the subversive doctrine to American youth. My fall 
became complete, and I joined a cell. 

Having burned my bonds behind me, I pressed forward in search of 
more knowledge. For I was eager to close the gaps in my new system. 
I wanted a complete mental picture of it, so that I could envisage what 
went on. But, alas, the further I went, the further I got from any 
system at all. Or if it was a system, it was one in which I could never 
hope to get things systematized. For I discovered that the living 
body is the scene of a bewildering game of organismic dynamics, played 
by aggregations of energy that deploy along metabolic gradients, surge 
among areas of changing potential, and maneuver in and out of incon- 
ceivable patterns. I found that learnings are the offspring of psychic 
miracles called insights; intelligence an intuitive capacity to grasp the 
relationships of figure in a ground; and behavior a mélange of field 
properties as elusive as flickers of light through a wind-stirred tree. 
This the price for the neat, compact, decipherable system that once was 
mine! 

These meditations bring me to the dismal conclusion that the 
psychologist in education is not the deus ex machina that he used to be. 
That exalted rdéle, it would seem, has flitted away with our S-R entities, 
and disappeared in an organismic haze. The psychologist will have to 
content himself with the humble part of experimenter, pursuing and 
compiling disparate facts and findings. But, alas, the handy frame- 
work, the infallible abacus, is shattered. The all-inclusive design is no 
more. Things are at odds. There is no educational psychology! 
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“* This is a really substantial contribution to a field that is overrun with 
speculation and personal opinion. Dr. Shaffer has found his way 
through the experimental material with a sure eye for the goal to be 
reached and yet has fashioned all of this material into a readable and 
wholesome study of human behavior.”—Coleman R. Griffith, Uni- 
versity of Illinois. 


“One of the finest books on the subject that I have seen.”’—S.C. 
Eurich, Untversity of Minnesota. 


“A remarkably attractive book from the standpoint of teacher, 
student, or reader. I have found it very satisfying as a book for the 


inquirer in the field of mental hygiene. It has no equal as a text.” — 
G. C. Fracker, College of Wooster. 


“An excellent piece of work in that it clearly presents the essential 
data of objective psychology. The viewpoint is clear and consistent 
and the value of the volume must be estimated in terms of the author’s 
non-acceptance of all hypotheses which are not translated into tangible 
data. It should be exceedingly useful as a textbook.”—Ira S. Wile, 
College of the City of New York. 


“In my opinion, it is by far the best book in the field . . . The author’s 
style is easy and lucid, and he never leaves any doubt as to what he 
means. The point of view is wholesome and objective; he has done a 
good job of chasing the ‘spooks’ which infest this field. The organiza- 
tion of the material is excellent. In addition to all this, the book is 
beautifully bound and printed.”—Alfred G. Dietz, University of 
Pittsburgh. 


‘One of the best treatments which has thus far appeared.” —Goodwin 
Watson, Teachers College, Columb1a University. 


“T am extremely interested in the point of view presented in the book, 
especially in the fact that it is in accord with most recent thought not 
only in psychology but in diversified fields from endocrine to psycho- 
analytic studies.”—Augusta F. Bronner, Judge Baker Guidance Cen- 
ter, Boston. 
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